Question: Is matching fixed regexes with back-references in P?

Question: Is matching fixed regexes with Back-references in P? – Branch Free

There is a persistent meme out there that matching regular expressions with back-references is NP-Hard. There is a post about this and the claim is repeated by Russ Cox so this is now part of received wisdom.

None of these claims are false; they just don’t apply to regular expression matching in the sense that most people would imagine (any more than, say, someone would claim, "colloquially" that summing a list of N integers is O(N^2) since it’s quite possible that each integer might be N bits long). It depends on the generally unfamiliar notion that the regular expression being matched might be arbitrarily varied to add more back-references.

These constructions rely on being able to add more things to the regular expression as the size of the problem that’s being reduced to ‘regex matching with back-references’ gets bigger.

Suppose, instead, as per more common practice, we are considering the difficulty of matching a fixed regular expressions with one or more back-references against an input of size N.

Is this task is in P? That is, is there a polynomial-time algorithm in the size of the input that will tell us whether this back-reference containing regular expression matched?

Note that back-references in a regular expression don’t "lock" – so the pattern /((\wx)\2)z/ will match "axaxbxbxz" (EDIT: sorry, I originally fat-fingered this example). So, sadly, we can’t just enumerate all starts and ending positions of every back-reference (say there are k backreferences) for a bad but polynomial-time algorithm (this would be O(N^2k) runs of our algorithm without back-references, so if we had a O(N) algorithm we could solve it in O(N^(2k+1)). Unfortunately, this construction doesn’t work – the capturing parentheses to which the back-references occur update, and so there can be numerous instances of them.

Note that even a lousy algorithm for establishing that this is possible suffices. So if there’s a construction that shows that we can match regular expressions with k backreferences in O(N^(100k^2+10000)) we’d still be in P, even if the algorithm is rubbish. I’ve read that (I forget the source) that, informally, a lousy poly-time algorithm can often be improved, but an exponential-time algorithm is intractable. So knowing that this problem was in P would be helpful.

So I’m curious – are there any either (a) results showing that fixed regex matching with back-references is also NP-hard, or (b) results, possibly the construction of a dreadfully naive algorithm, showing that it can be polynomial?

Share this:<br>Tweet

Share on Reddit (Opens in new window)<br>Reddit

Like Loading...

Published by geofflangdale

I work at Intel on the Hyperscan project: https://github.com/01org/hyperscan

Worked on JSON parsing with Daniel Lemire at: https://github.com/lemire/simdjson

Blog: branchfree.org

Currently between jobs.<br>View all posts by geofflangdale

3 thoughts on “Question: Is matching fixed regexes with Back-references in P?”

I think matching regex with backreferences, with a fixed number of captured groups k, is in P.

Here’s an implementation which I think achieves that:

https://github.com/travisdowns/polyregex

The basic idea is the same as the proof sketch on Twitter:

Here's a sketch of a proof (second try) that matching with backreferences is in P.

1/x

— Travis Downs (@trav_downs) April 7, 2019

The bound I found is O(n^(2k+2)) time and O(n^(2k+1)) space, which is very slightly different than the bound in the Twitter thread (because of the way actual backreference instances are expanded).

This isn’t meant to be a useful regex matcher, just a proof of concept! Even apart from being totally unoptimized, an O(n^20) algorithm (with 9 backrefs), might as well be exponential for most inputs.

Still, it may be the first matcher that doesn’t explode exponentially and yet supports backreferences.

LikeLike

I am not satisfied with the idea that there are n^(2k) start/stop pairs in the input for k backreferences. If a capturing subexpression and the corresponding backref appear inside a loop it will take on multiple different values – potentially O(n) different values.

LikeLike

I probably should have been more precise with my language: at any one time (while handing a given character in the input), for a single state (aka "path"), there is a single start/stop position (including the possibility of "not captured") for each capturing group. As you move on to later characters, that can definitely change – so the start/stop pair for each backreference can change up to n times for an n-length string. That’s fine though, and in fact it doesn’t even end up changing the order.

They key is that capturing groups have no "memory" – when a group gets captured for the second time, what got captured the first time doesn’t matter any more, later behavior only depends on the last match. That prevents the...

Question: Is matching fixed regexes with back-references in P?

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7