Regular Expression Matching Can Be Simple and Fast (2007)

poly2it1 pts0 comments

Regular Expression Matching Can Be Simple And Fast

Regular Expression Matching Can Be Simple And Fast

(but is slow in Java, Perl, PHP, Python, Ruby, ...)

Russ Cox

rsc@swtch.com

January 2007

Introduction

This is a tale of two approaches to regular expression matching.<br>One of them is in widespread use in the<br>standard interpreters for many languages, including Perl.<br>The other is used only in a few places, notably most implementations<br>of awk and grep.<br>The two approaches have wildly different<br>performance characteristics:

Time to match a?nan against an

Let's use superscripts to denote string repetition,<br>so that<br>a?3a3<br>is shorthand for<br>a?a?a?aaa.<br>The two graphs plot the time required by each approach<br>to match the regular expression<br>a?nan<br>against the string an.

Notice that Perl requires over sixty seconds to match<br>a 29-character string.<br>The other approach, labeled Thompson NFA for<br>reasons that will be explained later,<br>requires twenty microseconds to match the string.<br>That's not a typo. The Perl graph plots time in seconds,<br>while the Thompson NFA graph plots time in microseconds:<br>the Thompson NFA implementation<br>is a million times faster than Perl<br>when running on a miniscule 29-character string.<br>The trends shown in the graph continue: the<br>Thompson NFA handles a 100-character string in under 200 microseconds,<br>while Perl would require over 1015 years.<br>(Perl is only the most conspicuous example of a large<br>number of popular programs that use the same algorithm;<br>the above graph could have been Python, or PHP, or Ruby,<br>or many other languages. A more detailed<br>graph later in this article presents data for other implementations.)

It may be hard to believe the graphs: perhaps you've used Perl,<br>and it never seemed like regular expression matching was<br>particularly slow.<br>Most of the time, in fact, regular expression matching in Perl<br>is fast enough.<br>As the graph shows, though, it is possible<br>to write so-called &ldquo;pathological&rdquo; regular expressions that<br>Perl matches very very slowly.<br>In contrast, there are no regular expressions that are<br>pathological for the Thompson NFA implementation.<br>Seeing the two graphs side by side prompts the question,<br>&ldquo;why doesn't Perl use the Thompson NFA approach?&rdquo;<br>It can, it should, and that's what the rest of this article is about.

Historically, regular expressions are one of computer science's<br>shining examples of how using good theory leads to good programs.<br>They were originally developed by theorists as a<br>simple computational model,<br>but Ken Thompson introduced them to<br>programmers in his implementation of the text editor QED<br>for CTSS.<br>Dennis Ritchie followed suit in his own implementation<br>of QED, for GE-TSS.<br>Thompson and Ritchie would go on to create Unix,<br>and they brought regular expressions with them.<br>By the late 1970s, regular expressions were a key<br>feature of the Unix landscape, in tools such as<br>ed, sed, grep, egrep, awk, and lex.

Today, regular expressions have also become a shining<br>example of how ignoring good theory leads to bad programs.<br>The regular expression implementations used by<br>today's popular tools are significantly slower<br>than the ones used in many of those thirty-year-old Unix tools.

This article reviews the good theory:<br>regular expressions, finite automata,<br>and a regular expression search algorithm<br>invented by Ken Thompson in the mid-1960s.<br>It also puts the theory into practice, describing<br>a simple implementation of Thompson's algorithm.<br>That implementation, less than 400 lines of C,<br>is the one that went head to head with Perl above.<br>It outperforms the more complex real-world<br>implementations used by<br>Perl, Python, PCRE, and others.<br>The article concludes with a discussion of how<br>theory might yet be converted into practice<br>in the real-world implementations.

Regular Expressions

Regular expressions are a notation for<br>describing sets of character strings.<br>When a particular string is in the set<br>described by a regular expression,<br>we often say that the regular expression<br>matches<br>the string.

The simplest regular expression is a single literal character.<br>Except for the special metacharacters<br>*+?()|,<br>characters match themselves.<br>To match a metacharacter, escape it with<br>a backslash:<br>\+<br>matches a literal plus character.

Two regular expressions can be alternated or concatenated to form a new<br>regular expression:<br>if e1 matches<br>and e2 matches<br>t,<br>then e1|e2 matches<br>or<br>t,<br>and<br>e1e2<br>matches<br>st.

The metacharacters<br>*,<br>+,<br>and<br>are repetition operators:<br>e1*<br>matches a sequence of zero or more (possibly different)<br>strings, each of which match e1;<br>e1+<br>matches one or more;<br>e1?<br>matches zero or one.

The operator precedence, from weakest to strongest binding, is<br>first alternation, then concatenation, and finally the<br>repetition operators.<br>Explicit parentheses can be used to force different meanings,<br>just as in arithmetic expressions.<br>Some examples:<br>ab|cd<br>is equivalent to<br>(ab)|(cd);<br>ab*<br>is equivalent to<br>a(b*).

The syntax described so far is a subset of the traditional...

regular expression perl expressions thompson matches

Related Articles