Question: Is matching fixed regexes with Back-references in P? – Branch Free
Skip to content
There is a persistent meme out there that matching regular expressions with back-references is NP-Hard. There is a post about this and the claim is repeated by Russ Cox so this is now part of received wisdom.
None of these claims are false; they just don’t apply to regular expression matching in the sense that most people would imagine (any more than, say, someone would claim, "colloquially" that summing a list of N integers is O(N^2) since it’s quite possible that each integer might be N bits long). It depends on the generally unfamiliar notion that the regular expression being matched might be arbitrarily varied to add more back-references.
These constructions rely on being able to add more things to the regular expression as the size of the problem that’s being reduced to ‘regex matching with back-references’ gets bigger.
Suppose, instead, as per more common practice, we are considering the difficulty of matching a fixed regular expressions with one or more back-references against an input of size N.
Is this task is in P? That is, is there a polynomial-time algorithm in the size of the input that will tell us whether this back-reference containing regular expression matched?
Note that back-references in a regular expression don’t "lock" – so the pattern /((\wx)\2)z/ will match "axaxbxbxz" (EDIT: sorry, I originally fat-fingered this example). So, sadly, we can’t just enumerate all starts and ending positions of every back-reference (say there are k backreferences) for a bad but polynomial-time algorithm (this would be O(N^2k) runs of our algorithm without back-references, so if we had a O(N) algorithm we could solve it in O(N^(2k+1)). Unfortunately, this construction doesn’t work – the capturing parentheses to which the back-references occur update, and so there can be numerous instances of them.
Note that even a lousy algorithm for establishing that this is possible suffices. So if there’s a construction that shows that we can match regular expressions with k backreferences in O(N^(100k^2+10000)) we’d still be in P, even if the algorithm is rubbish. I’ve read that (I forget the source) that, informally, a lousy poly-time algorithm can often be improved, but an exponential-time algorithm is intractable. So knowing that this problem was in P would be helpful.
So I’m curious – are there any either (a) results showing that fixed regex matching with back-references is also NP-hard, or (b) results, possibly the construction of a dreadfully naive algorithm, showing that it can be polynomial?
Share this:<br>Tweet
Share on Reddit (Opens in new window)<br>Reddit
Like Loading...
Related
Published by geofflangdale
I work at Intel on the Hyperscan project: https://github.com/01org/hyperscan
Worked on JSON parsing with Daniel Lemire at: https://github.com/lemire/simdjson
Blog: branchfree.org
Currently between jobs.<br>View all posts by geofflangdale
3 thoughts on “Question: Is matching fixed regexes with Back-references in P?”
I think matching regex with backreferences, with a fixed number of captured groups k, is in P.
Here’s an implementation which I think achieves that:
https://github.com/travisdowns/polyregex
The basic idea is the same as the proof sketch on Twitter:
Here's a sketch of a proof (second try) that matching with backreferences is in P.
1/x
— Travis Downs (@trav_downs) April 7, 2019
The bound I found is O(n^(2k+2)) time and O(n^(2k+1)) space, which is very slightly different than the bound in the Twitter thread (because of the way actual backreference instances are expanded).
This isn’t meant to be a useful regex matcher, just a proof of concept! Even apart from being totally unoptimized, an O(n^20) algorithm (with 9 backrefs), might as well be exponential for most inputs.
Still, it may be the first matcher that doesn’t explode exponentially and yet supports backreferences.
LikeLike
I am not satisfied with the idea that there are n^(2k) start/stop pairs in the input for k backreferences. If a capturing subexpression and the corresponding backref appear inside a loop it will take on multiple different values – potentially O(n) different values.
LikeLike
I probably should have been more precise with my language: at any one time (while handing a given character in the input), for a single state (aka "path"), there is a single start/stop position (including the possibility of "not captured") for each capturing group. As you move on to later characters, that can definitely change – so the start/stop pair for each backreference can change up to n times for an n-length string. That’s fine though, and in fact it doesn’t even end up changing the order.
They key is that capturing groups have no "memory" – when a group gets captured for the second time, what got captured the first time doesn’t matter any more, later behavior only depends on the last match. That prevents the...