Regular Expression Matching: the Virtual Machine Approach
Regular Expression Matching: the Virtual Machine Approach
Russ Cox
rsc@swtch.com
December 2009
Introduction
Name the most widely used bytecode interpreter or virtual machine.<br>Sun's JVM? Adobe's Flash? .NET and Mono? Perl? Python? PHP?<br>These are all certainly popular, but there's one<br>more widely used than all those combined.<br>That bytecode interpreter is Henry Spencer's regular expression<br>library and its many descendants.
The first article in this series described<br>the two main strategies for implementing regular expression matching:<br>the worst-case linear-time NFA- and DFA-based<br>strategies used in awk and egrep (and now most greps),<br>and the worst-case exponential-time backtracking strategy used almost<br>everywhere else, including ed, sed, Perl, PCRE, and Python.
This article presents two strategies as two<br>different ways to implement a virtual machine<br>that executes a regular expression that has been<br>compiled into text-matching bytecodes,<br>just like .NET and Mono are different ways to implement a virtual machine<br>that executes a program that has been compiled into<br>CLI bytecodes.
Viewing regular expression matching as executing<br>a special machine makes it possible to add new features<br>just by adding (and implementing!) new machine instructions.<br>In particular, we can add regular expression<br>submatching instructions, so that<br>after matching (a+)(b+)<br>against aabbbb, a program can find out that<br>the parenthesized (a+)<br>(often referred to as \1 or $1)<br>matched aa and that (b+) matched bbbb.<br>Submatching can be implemented in both backtracking<br>and non-backtracking VMs.<br>(Code doing this dates back to 1985, but I believe this article is the first<br>written explanation of it.)
A Regular Expression Virtual Machine
To start, we'll define a regular expression virtual machine<br>(think Java VM).<br>The VM executes one or more threads, each running<br>a regular expression program, which is just a list of<br>regular expression instructions.<br>Each thread maintains two registers while it runs: a program counter (PC)<br>and a string pointer (SP).
The regular expression instructions are:
char c<br>If the character SP points at is not c, stop this thread: it failed.<br>Otherwise, advance SP to the next character<br>and advance PC to the next instruction.<br>match<br>Stop this thread: it found a match.<br>jmp x<br>Jump to (set the PC to point at) the instruction at x.<br>split x, y<br>Split execution: continue at both x and y.<br>Create a new thread with SP copied from the<br>current thread. One thread continues with PC x.<br>The other continues with PC y.<br>(Like a simultaneous jump to both locations.)
The VM starts with a single thread running with its<br>PC pointing at the beginning of the program and its SP pointing<br>at the beginning of the input string.<br>To run a thread, the VM executes the instruction<br>that the thread's PC points at; executing the instruction changes<br>the thread's PC to point at the next instruction to run. Repeat until<br>an instruction (a failed char or a match) stops the thread.<br>The regular expression matches a string if any thread finds a match.
Compiling a regular expression into byte code proceeds<br>recursively depending on the form of the regular expression.<br>Recall from the previous article<br>that regular expressions come<br>in four forms: a single letter like a,<br>a concatenation e1e2,<br>an alternation e1|e2,<br>or a repetition e? (zero or one),<br>e* (zero or more),<br>or<br>e+ (one or more).
A single letter a compiles into the single<br>instruction char a.<br>A concatenation concatenates the compiled form of the<br>two subexpressions.<br>An alternation uses a split to allow either<br>choice to succeed.<br>A zero-or-one repetition e? uses a split to<br>compile like an alternation with the empty string.<br>The zero-or-more repetition e* and the<br>one-or-more repetition e+ use a split<br>to choose whether to match e or break out of the repetition.
The exact code sequences are:
char a
e1e2<br>codes for e1
codes for e2
e1|e2<br>split L1, L2
L1: codes for e1
jmp L3
L2: codes for e2
L3:
e?<br>split L1, L2
L1: codes for e
L2:
e*<br>L1: split L2, L3
L2: codes for e
jmp L1
L3:
e+<br>L1: codes for e
split L1, L3
L3:
Once the entire regular expression has been compiled,<br>the generated code is finished with a final match<br>instruction.
As an example, the regular expression a+b+ compiles into
0 char a<br>1 split 0, 2<br>2 char b<br>3 split 2, 4<br>4 match
When run on aab, a VM implementation<br>might run the program this way:
Thread PC SP Execution
T1<br>0 char a<br>a ab<br>character matches
T1<br>1 split 0, 2<br>aa b<br>creates thread T2 at PC=2 SP=aa b
T1<br>0 char a<br>aa b<br>character matches
T1<br>1 split 0, 2<br>aa b<br>creates thread T3 at PC=2 SP=aab
T1<br>0 char a<br>aab<br>no match: thread T1 dies
T2<br>2 char b<br>aa b<br>no match: thread T2 dies
T3<br>2 char b<br>aab<br>character matches
T3<br>3 split 2, 4<br>abb<br>creates thread T4 at PC=4 SP=abb
T3<br>2 char b<br>abb<br>no match (end of string): thread T3 dies
T4<br>4 match<br>abb<br>match!
In this example, the implementation waits to run a new thread until<br>the current thread finishes,...