Regex engine internals as a library (2023)

tosh1 pts0 comments

Regex engine internals as a library - Andrew Gallant's Blog

Regex engine internals as a library

Jul 5, 2023

Over the last several years, I&rsquo;ve rewritten Rust&rsquo;s regex<br>crate to enable better internal composition, and to make it<br>easier to add optimizations while maintaining correctness. In the course of<br>this rewrite I created a new crate, regex-automata, which exposes much<br>of the regex crate internals as their own APIs for others to use. To my<br>knowledge, this is the first regex library to expose its internals to the<br>degree done in regex-automata as a separately versioned library.

This blog post discusses the problems that led to the rewrite, how the rewrite<br>solved them and a guided tour of regex-automata&rsquo;s API.

Target audience : Rust programmers and anyone with an interest in how one<br>particular finite automata regex engine is implemented. Prior experience with<br>regular expressions is assumed.

Table of Contents

Brief history

The problems

Problem: composition was difficult

Problem: testing was difficult

Problem: requests for niche APIs

Problem: fully compiled DFAs

Follow along with regex-cli

Flow of data

Literal optimizations

Motivating literal optimizations

Literal extraction

Searching for literals

The NFA data type

A simple NFA example

NFA optimization: sparse states

NFA optimization: minimal UTF-8 automata

NFA optimization: literal trie

NFA future work

Regex engines

Common elements among regex engines

Engine: Pike VM

Engine: bounded backtracker

Engine: one-pass DFA

Engine: DFA

Engine: hybrid NFA/DFA

The meta regex engine

Differences with RE2

Testing strategy

Benchmarking

Costs

Wrap up

Brief history

In September 2012, an issue was filed on the Rust<br>repository requesting that a regex library be added to the<br>Rust Distribution. Graydon Hoare later commented in that thread<br>that they preferred RE2. For those that don&rsquo;t know, RE2 is a regex engine<br>that uses finite automata to guarantee O(m * n) worst case search time<br>while providing a Perl-like syntax that excludes features that are not known<br>how to implement efficiently. RE2&rsquo;s design is described by its author, Russ<br>Cox, in a series of articles on implementing a regex engine using finite<br>automata.

In April 2014, I showed up and said I was working on a regex engine inspired<br>by RE2. I treated Cox&rsquo;s articles as a blueprint for how to<br>build a regex library. Soon there after, I published an RFC to add a regex<br>library to the &ldquo;Rust Distribution.&rdquo; This was before Rust 1.0<br>and Cargo (the second version, not the first), and the &ldquo;Rust<br>Distribution&rdquo; referred to rustc, std and several &ldquo;supporting&rdquo; libraries<br>that were all bundled together. This RFC proposed adding a regex crate to<br>that list of supporting libraries.

Ten days later, the RFC was approved. The next<br>day, I submitted a pull request to rust-lang/rust,<br>adding it to the Rust distribution. Things moved fast back then. Notice also<br>that I had originally called the crate regexp. The PR to Rust involved a<br>discussion about naming that eventually resulted in it being called regex<br>instead.

Two years later in May 2016, I wrote an RFC to release regex 1.0. That took a few months to be approved, but it wasn&rsquo;t<br>until a couple years later in May 2018 that I actually released regex 1.0.

Before regex 1.0 was released, I had been steadily working on a complete<br>re-imagining of the crate internals. From a commit message in March<br>2018:

The [regex-syntax] rewrite is intended to be the first phase in an effort to<br>overhaul the entire regex crate.

I didn&rsquo;t know exactly where I was going at that point in time, but in<br>March 2020, I started work in earnest on rewriting the actual matching<br>engines. A little more than three years later, regex 1.9 has been<br>released with the completed rewrite.

The problems

What kinds of problems were facing the regex crate that warranted a full<br>rewrite? And moreover, why publish the rewritten internals as its own crate?

There are a host of things to discuss here.

Problem: composition was difficult

Following in the tradition of RE2, the regex crate contains a<br>number of different strategies that it can use to implement a search. Sometimes<br>multiple strategies are used in a single search call.

There are generally two dimensions, often at odds with one another, to each<br>strategy: performance and functionality. Faster strategies tend to be more<br>limited in functionality. For example, a fast strategy might be able to report<br>the start and end of a match but not the offsets for each capture group in the<br>regex. Conversely, a slower strategy might be needed to report the offsets of<br>each capture group.

When I originally wrote the regex crate, I implemented a single strategy<br>(the PikeVM) and didn&rsquo;t do any thoughtful design work for how to incorporate<br>alternative strategies. Eventually, new strategies were added organically:

A BoundedBacktracker that can report capture group offsets like...

regex engine rust crate rsquo library

Related Articles