Regular Expression Matching: The Virtual Machine Approach (2009)

tosh1 pts0 comments

Regular Expression Matching: the Virtual Machine Approach

Regular Expression Matching: the Virtual Machine Approach

Russ Cox

rsc@swtch.com

December 2009

Introduction

Name the most widely used bytecode interpreter or virtual machine.<br>Sun's JVM? Adobe's Flash? .NET and Mono? Perl? Python? PHP?<br>These are all certainly popular, but there's one<br>more widely used than all those combined.<br>That bytecode interpreter is Henry Spencer's regular expression<br>library and its many descendants.

The first article in this series described<br>the two main strategies for implementing regular expression matching:<br>the worst-case linear-time NFA- and DFA-based<br>strategies used in awk and egrep (and now most greps),<br>and the worst-case exponential-time backtracking strategy used almost<br>everywhere else, including ed, sed, Perl, PCRE, and Python.

This article presents two strategies as two<br>different ways to implement a virtual machine<br>that executes a regular expression that has been<br>compiled into text-matching bytecodes,<br>just like .NET and Mono are different ways to implement a virtual machine<br>that executes a program that has been compiled into<br>CLI bytecodes.

Viewing regular expression matching as executing<br>a special machine makes it possible to add new features<br>just by adding (and implementing!) new machine instructions.<br>In particular, we can add regular expression<br>submatching instructions, so that<br>after matching (a+)(b+)<br>against aabbbb, a program can find out that<br>the parenthesized (a+)<br>(often referred to as \1 or $1)<br>matched aa and that (b+) matched bbbb.<br>Submatching can be implemented in both backtracking<br>and non-backtracking VMs.<br>(Code doing this dates back to 1985, but I believe this article is the first<br>written explanation of it.)

A Regular Expression Virtual Machine

To start, we'll define a regular expression virtual machine<br>(think Java VM).<br>The VM executes one or more threads, each running<br>a regular expression program, which is just a list of<br>regular expression instructions.<br>Each thread maintains two registers while it runs: a program counter (PC)<br>and a string pointer (SP).

The regular expression instructions are:

char c<br>If the character SP points at is not c, stop this thread: it failed.<br>Otherwise, advance SP to the next character<br>and advance PC to the next instruction.<br>match<br>Stop this thread: it found a match.<br>jmp x<br>Jump to (set the PC to point at) the instruction at x.<br>split x, y<br>Split execution: continue at both x and y.<br>Create a new thread with SP copied from the<br>current thread. One thread continues with PC x.<br>The other continues with PC y.<br>(Like a simultaneous jump to both locations.)

The VM starts with a single thread running with its<br>PC pointing at the beginning of the program and its SP pointing<br>at the beginning of the input string.<br>To run a thread, the VM executes the instruction<br>that the thread's PC points at; executing the instruction changes<br>the thread's PC to point at the next instruction to run. Repeat until<br>an instruction (a failed char or a match) stops the thread.<br>The regular expression matches a string if any thread finds a match.

Compiling a regular expression into byte code proceeds<br>recursively depending on the form of the regular expression.<br>Recall from the previous article<br>that regular expressions come<br>in four forms: a single letter like a,<br>a concatenation e1e2,<br>an alternation e1|e2,<br>or a repetition e? (zero or one),<br>e* (zero or more),<br>or<br>e+ (one or more).

A single letter a compiles into the single<br>instruction char a.<br>A concatenation concatenates the compiled form of the<br>two subexpressions.<br>An alternation uses a split to allow either<br>choice to succeed.<br>A zero-or-one repetition e? uses a split to<br>compile like an alternation with the empty string.<br>The zero-or-more repetition e* and the<br>one-or-more repetition e+ use a split<br>to choose whether to match e or break out of the repetition.

The exact code sequences are:

char a

e1e2<br>codes for e1

codes for e2

e1|e2<br>split L1, L2

L1: codes for e1

jmp L3

L2: codes for e2

L3:

e?<br>split L1, L2

L1: codes for e

L2:

e*<br>L1: split L2, L3

L2: codes for e

jmp L1

L3:

e+<br>L1: codes for e

split L1, L3

L3:

Once the entire regular expression has been compiled,<br>the generated code is finished with a final match<br>instruction.

As an example, the regular expression a+b+ compiles into

0 char a<br>1 split 0, 2<br>2 char b<br>3 split 2, 4<br>4 match

When run on aab, a VM implementation<br>might run the program this way:

Thread PC SP Execution

T1<br>0 char a<br>a ab<br>character matches

T1<br>1 split 0, 2<br>aa b<br>creates thread T2 at PC=2 SP=aa b

T1<br>0 char a<br>aa b<br>character matches

T1<br>1 split 0, 2<br>aa b<br>creates thread T3 at PC=2 SP=aab

T1<br>0 char a<br>aab<br>no match: thread T1 dies

T2<br>2 char b<br>aa b<br>no match: thread T2 dies

T3<br>2 char b<br>aab<br>character matches

T3<br>3 split 2, 4<br>abb<br>creates thread T4 at PC=4 SP=abb

T3<br>2 char b<br>abb<br>no match (end of string): thread T3 dies

T4<br>4 match<br>abb<br>match!

In this example, the implementation waits to run a new thread until<br>the current thread finishes,...

thread regular expression split char match

Related Articles