Regular Expression Matching Can Be Simple And Fast
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
Russ Cox
rsc@swtch.com
January 2007
Introduction
This is a tale of two approaches to regular expression matching.<br>One of them is in widespread use in the<br>standard interpreters for many languages, including Perl.<br>The other is used only in a few places, notably most implementations<br>of awk and grep.<br>The two approaches have wildly different<br>performance characteristics:
Time to match a?nan against an
Let's use superscripts to denote string repetition,<br>so that<br>a?3a3<br>is shorthand for<br>a?a?a?aaa.<br>The two graphs plot the time required by each approach<br>to match the regular expression<br>a?nan<br>against the string an.
Notice that Perl requires over sixty seconds to match<br>a 29-character string.<br>The other approach, labeled Thompson NFA for<br>reasons that will be explained later,<br>requires twenty microseconds to match the string.<br>That's not a typo. The Perl graph plots time in seconds,<br>while the Thompson NFA graph plots time in microseconds:<br>the Thompson NFA implementation<br>is a million times faster than Perl<br>when running on a miniscule 29-character string.<br>The trends shown in the graph continue: the<br>Thompson NFA handles a 100-character string in under 200 microseconds,<br>while Perl would require over 1015 years.<br>(Perl is only the most conspicuous example of a large<br>number of popular programs that use the same algorithm;<br>the above graph could have been Python, or PHP, or Ruby,<br>or many other languages. A more detailed<br>graph later in this article presents data for other implementations.)
It may be hard to believe the graphs: perhaps you've used Perl,<br>and it never seemed like regular expression matching was<br>particularly slow.<br>Most of the time, in fact, regular expression matching in Perl<br>is fast enough.<br>As the graph shows, though, it is possible<br>to write so-called “pathological” regular expressions that<br>Perl matches very very slowly.<br>In contrast, there are no regular expressions that are<br>pathological for the Thompson NFA implementation.<br>Seeing the two graphs side by side prompts the question,<br>“why doesn't Perl use the Thompson NFA approach?”<br>It can, it should, and that's what the rest of this article is about.
Historically, regular expressions are one of computer science's<br>shining examples of how using good theory leads to good programs.<br>They were originally developed by theorists as a<br>simple computational model,<br>but Ken Thompson introduced them to<br>programmers in his implementation of the text editor QED<br>for CTSS.<br>Dennis Ritchie followed suit in his own implementation<br>of QED, for GE-TSS.<br>Thompson and Ritchie would go on to create Unix,<br>and they brought regular expressions with them.<br>By the late 1970s, regular expressions were a key<br>feature of the Unix landscape, in tools such as<br>ed, sed, grep, egrep, awk, and lex.
Today, regular expressions have also become a shining<br>example of how ignoring good theory leads to bad programs.<br>The regular expression implementations used by<br>today's popular tools are significantly slower<br>than the ones used in many of those thirty-year-old Unix tools.
This article reviews the good theory:<br>regular expressions, finite automata,<br>and a regular expression search algorithm<br>invented by Ken Thompson in the mid-1960s.<br>It also puts the theory into practice, describing<br>a simple implementation of Thompson's algorithm.<br>That implementation, less than 400 lines of C,<br>is the one that went head to head with Perl above.<br>It outperforms the more complex real-world<br>implementations used by<br>Perl, Python, PCRE, and others.<br>The article concludes with a discussion of how<br>theory might yet be converted into practice<br>in the real-world implementations.
Regular Expressions
Regular expressions are a notation for<br>describing sets of character strings.<br>When a particular string is in the set<br>described by a regular expression,<br>we often say that the regular expression<br>matches<br>the string.
The simplest regular expression is a single literal character.<br>Except for the special metacharacters<br>*+?()|,<br>characters match themselves.<br>To match a metacharacter, escape it with<br>a backslash:<br>\+<br>matches a literal plus character.
Two regular expressions can be alternated or concatenated to form a new<br>regular expression:<br>if e1 matches<br>and e2 matches<br>t,<br>then e1|e2 matches<br>or<br>t,<br>and<br>e1e2<br>matches<br>st.
The metacharacters<br>*,<br>+,<br>and<br>are repetition operators:<br>e1*<br>matches a sequence of zero or more (possibly different)<br>strings, each of which match e1;<br>e1+<br>matches one or more;<br>e1?<br>matches zero or one.
The operator precedence, from weakest to strongest binding, is<br>first alternation, then concatenation, and finally the<br>repetition operators.<br>Explicit parentheses can be used to force different meanings,<br>just as in arithmetic expressions.<br>Some examples:<br>ab|cd<br>is equivalent to<br>(ab)|(cd);<br>ab*<br>is equivalent to<br>a(b*).
The syntax described so far is a subset of the traditional...