Regex engine internals as a library - Andrew Gallant's Blog
Regex engine internals as a library
Jul 5, 2023
Over the last several years, I’ve rewritten Rust’s regex<br>crate to enable better internal composition, and to make it<br>easier to add optimizations while maintaining correctness. In the course of<br>this rewrite I created a new crate, regex-automata, which exposes much<br>of the regex crate internals as their own APIs for others to use. To my<br>knowledge, this is the first regex library to expose its internals to the<br>degree done in regex-automata as a separately versioned library.
This blog post discusses the problems that led to the rewrite, how the rewrite<br>solved them and a guided tour of regex-automata’s API.
Target audience : Rust programmers and anyone with an interest in how one<br>particular finite automata regex engine is implemented. Prior experience with<br>regular expressions is assumed.
Table of Contents
Brief history
The problems
Problem: composition was difficult
Problem: testing was difficult
Problem: requests for niche APIs
Problem: fully compiled DFAs
Follow along with regex-cli
Flow of data
Literal optimizations
Motivating literal optimizations
Literal extraction
Searching for literals
The NFA data type
A simple NFA example
NFA optimization: sparse states
NFA optimization: minimal UTF-8 automata
NFA optimization: literal trie
NFA future work
Regex engines
Common elements among regex engines
Engine: Pike VM
Engine: bounded backtracker
Engine: one-pass DFA
Engine: DFA
Engine: hybrid NFA/DFA
The meta regex engine
Differences with RE2
Testing strategy
Benchmarking
Costs
Wrap up
Brief history
In September 2012, an issue was filed on the Rust<br>repository requesting that a regex library be added to the<br>Rust Distribution. Graydon Hoare later commented in that thread<br>that they preferred RE2. For those that don’t know, RE2 is a regex engine<br>that uses finite automata to guarantee O(m * n) worst case search time<br>while providing a Perl-like syntax that excludes features that are not known<br>how to implement efficiently. RE2’s design is described by its author, Russ<br>Cox, in a series of articles on implementing a regex engine using finite<br>automata.
In April 2014, I showed up and said I was working on a regex engine inspired<br>by RE2. I treated Cox’s articles as a blueprint for how to<br>build a regex library. Soon there after, I published an RFC to add a regex<br>library to the “Rust Distribution.” This was before Rust 1.0<br>and Cargo (the second version, not the first), and the “Rust<br>Distribution” referred to rustc, std and several “supporting” libraries<br>that were all bundled together. This RFC proposed adding a regex crate to<br>that list of supporting libraries.
Ten days later, the RFC was approved. The next<br>day, I submitted a pull request to rust-lang/rust,<br>adding it to the Rust distribution. Things moved fast back then. Notice also<br>that I had originally called the crate regexp. The PR to Rust involved a<br>discussion about naming that eventually resulted in it being called regex<br>instead.
Two years later in May 2016, I wrote an RFC to release regex 1.0. That took a few months to be approved, but it wasn’t<br>until a couple years later in May 2018 that I actually released regex 1.0.
Before regex 1.0 was released, I had been steadily working on a complete<br>re-imagining of the crate internals. From a commit message in March<br>2018:
The [regex-syntax] rewrite is intended to be the first phase in an effort to<br>overhaul the entire regex crate.
I didn’t know exactly where I was going at that point in time, but in<br>March 2020, I started work in earnest on rewriting the actual matching<br>engines. A little more than three years later, regex 1.9 has been<br>released with the completed rewrite.
The problems
What kinds of problems were facing the regex crate that warranted a full<br>rewrite? And moreover, why publish the rewritten internals as its own crate?
There are a host of things to discuss here.
Problem: composition was difficult
Following in the tradition of RE2, the regex crate contains a<br>number of different strategies that it can use to implement a search. Sometimes<br>multiple strategies are used in a single search call.
There are generally two dimensions, often at odds with one another, to each<br>strategy: performance and functionality. Faster strategies tend to be more<br>limited in functionality. For example, a fast strategy might be able to report<br>the start and end of a match but not the offsets for each capture group in the<br>regex. Conversely, a slower strategy might be needed to report the offsets of<br>each capture group.
When I originally wrote the regex crate, I implemented a single strategy<br>(the PikeVM) and didn’t do any thoughtful design work for how to incorporate<br>alternative strategies. Eventually, new strategies were added organically:
A BoundedBacktracker that can report capture group offsets like...