Regular Expression Matching: The Virtual Machine Approach (2009)

Regular Expression Matching: the Virtual Machine Approach

Russ Cox

rsc@swtch.com

December 2009

Introduction

Name the most widely used bytecode interpreter or virtual machine. Sun's JVM? Adobe's Flash? .NET and Mono? Perl? Python? PHP? These are all certainly popular, but there's one more widely used than all those combined. That bytecode interpreter is Henry Spencer's regular expression library and its many descendants.

The first article in this series described the two main strategies for implementing regular expression matching: the worst-case linear-time NFA- and DFA-based strategies used in awk and egrep (and now most greps), and the worst-case exponential-time backtracking strategy used almost everywhere else, including ed, sed, Perl, PCRE, and Python.

This article presents two strategies as two different ways to implement a virtual machine that executes a regular expression that has been compiled into text-matching bytecodes, just like .NET and Mono are different ways to implement a virtual machine that executes a program that has been compiled into CLI bytecodes.

Viewing regular expression matching as executing a special machine makes it possible to add new features just by adding (and implementing!) new machine instructions. In particular, we can add regular expression submatching instructions, so that after matching (a+)(b+) against aabbbb, a program can find out that the parenthesized (a+) (often referred to as \1 or $1) matched aa and that (b+) matched bbbb. Submatching can be implemented in both backtracking and non-backtracking VMs. (Code doing this dates back to 1985, but I believe this article is the first written explanation of it.)

A Regular Expression Virtual Machine

To start, we'll define a regular expression virtual machine (think Java VM). The VM executes one or more threads, each running a regular expression program, which is just a list of regular expression instructions. Each thread maintains two registers while it runs: a program counter (PC) and a string pointer (SP).

The regular expression instructions are:

char c If the character SP points at is not c, stop this thread: it failed. Otherwise, advance SP to the next character and advance PC to the next instruction. match Stop this thread: it found a match. jmp x Jump to (set the PC to point at) the instruction at x. split x, y Split execution: continue at both x and y. Create a new thread with SP copied from the current thread. One thread continues with PC x. The other continues with PC y. (Like a simultaneous jump to both locations.)

The VM starts with a single thread running with its PC pointing at the beginning of the program and its SP pointing at the beginning of the input string. To run a thread, the VM executes the instruction that the thread's PC points at; executing the instruction changes the thread's PC to point at the next instruction to run. Repeat until an instruction (a failed char or a match) stops the thread. The regular expression matches a string if any thread finds a match.

Compiling a regular expression into byte code proceeds recursively depending on the form of the regular expression. Recall from the previous article that regular expressions come in four forms: a single letter like a, a concatenation e1e2, an alternation e1|e2, or a repetition e? (zero or one), e* (zero or more), or e+ (one or more).

A single letter a compiles into the single instruction char a. A concatenation concatenates the compiled form of the two subexpressions. An alternation uses a split to allow either choice to succeed. A zero-or-one repetition e? uses a split to compile like an alternation with the empty string. The zero-or-more repetition e* and the one-or-more repetition e+ use a split to choose whether to match e or break out of the repetition.

The exact code sequences are:

char a

e1e2 codes for e1

codes for e2

e1|e2 split L1, L2

L1: codes for e1

jmp L3

L2: codes for e2

L3:

e? split L1, L2

L1: codes for e

L2:

e* L1: split L2, L3

L2: codes for e

jmp L1

L3:

e+ L1: codes for e

split L1, L3

L3:

Once the entire regular expression has been compiled, the generated code is finished with a final match instruction.

As an example, the regular expression a+b+ compiles into

0 char a 1 split 0, 2 2 char b 3 split 2, 4 4 match

When run on aab, a VM implementation might run the program this way:

Thread PC SP Execution

T1 0 char a a ab character matches

T1 1 split 0, 2 aa b creates thread T2 at PC=2 SP=aa b

T1 0 char a aa b character matches

T1 1 split 0, 2 aa b creates thread T3 at PC=2 SP=aab

T1 0 char a aab no match: thread T1 dies

T2 2 char b aa b no match: thread T2 dies

T3 2 char b aab character matches

T3 3 split 2, 4 abb creates thread T4 at PC=4 SP=abb

T3 2 char b abb no match (end of string): thread T3 dies

T4 4 match abb match!

In this example, the implementation waits to run a new thread until the current thread finishes,...

Regular Expression Matching: The Virtual Machine Approach (2009)

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan