The true power of regular expressions (2012)

The true power of regular expressions

As someone who frequents the PHP tag on StackOverflow I pretty often see questions about how to parse some particular aspect of HTML using regular expressions. A common reply to such a question is:

You cannot parse HTML with regular expressions, because HTML isn’t regular. Use an XML parser instead.

This statement - in the context of the question - is somewhere between very misleading and outright wrong. What I’ll try to demonstrate in this article is how powerful modern regular expressions really are.

What does “regular” actually mean?

In the context of formal language theory, something is called “regular” when it has a grammar where all production rules have one of the following forms:

B -> a B -> aC B -> ε

You can read those -> rules as “The left hand side can be replaced with the right hand side”. So the first rule would be “B can be replaced with a”, the second one “B can be replaced with aC” and the third one “B can be replaced with the empty string” (ε is the symbol for the empty string).

So what are B, C and a? By convention, uppercase characters denote so called “non-terminals” - symbols which can be broken down further - and lowercase characters denote “terminals” - symbols which cannot be broken down any further.

All that probably sounds a bit abstract, so let’s look at an example: Defining the natural numbers as a grammar.

N -> 0 N -> 1 N -> 2 N -> 3 N -> 4 N -> 5 N -> 6 N -> 7 N -> 8 N -> 9 N -> 0N N -> 1N N -> 2N N -> 3N N -> 4N N -> 5N N -> 6N N -> 7N N -> 8N N -> 9N

What this grammar says is:

A natural number (N) is ... one of the digits 0 to 9 or ... one of the digits 0 to 9 followed by another natural number (N)

In this example the digits 0 to 9 would be terminals (as they can’t be broken down any further) and N would be the only non-terminal (as it can be and is broken down further).

If you have another look at the rules and compare them to the definition of a regular grammar from above, you’ll see that they meet the criteria: The first ten rules are of the form B -> a and the second ten rules follow the form B -> aC. Thus the grammar defining the natural numbers is regular.

Another thing you might notice is that even though the above grammar defines such a simple thing, it is already quite bloated. Wouldn’t it be better if we could express the same concept in a more concise manner?

And that’s where regular expressions come in: The above grammar is equivalent to the regex [0-9]+ (which is a hell lot simpler). And this kind of transformation can be done with any regular grammar: Every regular grammar has a corresponding regular expression which defines all its valid strings.

What can regular expressions match?

Thus the question arises: Can regular expressions match only regular grammars, or can they also match more? The answer to this is both yes and no:

Regular expressions in the formal grammar sense can (pretty much by definition) only parse regular grammars and nothing more.

But when programmers talk about “regular expressions” they aren’t talking about formal grammars. They are talking about the regular expression derivative which their language implements. And those regex implementations are only very slightly related to the original notion of regularity.

Any modern regex flavor can match a lot more than just regular languages. How much exactly, that’s what the rest of the article is about.

To keep things simple, I’ll focus on the PCRE regex implementation in the following, simply because I know it best (as it’s used by PHP). Most other regex implementations are quite similar though, so most stuff should apply to them too.

The language hierarchy

In order to analyze what regular expressions can and cannot match, we first have to look at what other types of languages there are. A good starting point for this is the Chomsky hierarchy:

Chomsky hierarchy:

/-------------------------------------------\ | | | Recursively enumerable languages | Type 0 | | | /-----------------------------------\ | | | | | | | Context-sensitive languages | | Type 1 | | | | | | /---------------------------\ | | | | | | | | | | | Context-free languages | | | Type 2 | | | | | | | | | /-------------------\ | | | | | | | Regular languages | | | | Type 3 | | | \-------------------/ | | | | | \---------------------------/ | | | \-----------------------------------/ | \-------------------------------------------/

As you can see the Chomsky hierarchy divides formal languages into four types:

Regular languages (Type 3) are the least-powerful, followed by the context-free languages (Type 2), the context-sensitive languages (Type 1) and at last the all-mighty recursively enumerable languages (Type 0).

The Chomsky hierarchy is a containment hierarchy, so the smaller boxes in the above image are fully contained in the larger boxes. For example every regular language is also a context-free language (but not the other way around!)

So, let’s move one step up...

The true power of regular expressions (2012)

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan