Regex engine internals as a library (2023)

Regex engine internals as a library - Andrew Gallant's Blog

Regex engine internals as a library

Jul 5, 2023

Over the last several years, I’ve rewritten Rust’s regex crate to enable better internal composition, and to make it easier to add optimizations while maintaining correctness. In the course of this rewrite I created a new crate, regex-automata, which exposes much of the regex crate internals as their own APIs for others to use. To my knowledge, this is the first regex library to expose its internals to the degree done in regex-automata as a separately versioned library.

This blog post discusses the problems that led to the rewrite, how the rewrite solved them and a guided tour of regex-automata’s API.

Target audience : Rust programmers and anyone with an interest in how one particular finite automata regex engine is implemented. Prior experience with regular expressions is assumed.

Table of Contents

Brief history

The problems

Problem: composition was difficult

Problem: testing was difficult

Problem: requests for niche APIs

Problem: fully compiled DFAs

Follow along with regex-cli

Flow of data

Literal optimizations

Motivating literal optimizations

Literal extraction

Searching for literals

The NFA data type

A simple NFA example

NFA optimization: sparse states

NFA optimization: minimal UTF-8 automata

NFA optimization: literal trie

NFA future work

Regex engines

Common elements among regex engines

Engine: Pike VM

Engine: bounded backtracker

Engine: one-pass DFA

Engine: DFA

Engine: hybrid NFA/DFA

The meta regex engine

Differences with RE2

Testing strategy

Benchmarking

Costs

Wrap up

Brief history

In September 2012, an issue was filed on the Rust repository requesting that a regex library be added to the Rust Distribution. Graydon Hoare later commented in that thread that they preferred RE2. For those that don’t know, RE2 is a regex engine that uses finite automata to guarantee O(m * n) worst case search time while providing a Perl-like syntax that excludes features that are not known how to implement efficiently. RE2’s design is described by its author, Russ Cox, in a series of articles on implementing a regex engine using finite automata.

In April 2014, I showed up and said I was working on a regex engine inspired by RE2. I treated Cox’s articles as a blueprint for how to build a regex library. Soon there after, I published an RFC to add a regex library to the “Rust Distribution.” This was before Rust 1.0 and Cargo (the second version, not the first), and the “Rust Distribution” referred to rustc, std and several “supporting” libraries that were all bundled together. This RFC proposed adding a regex crate to that list of supporting libraries.

Ten days later, the RFC was approved. The next day, I submitted a pull request to rust-lang/rust, adding it to the Rust distribution. Things moved fast back then. Notice also that I had originally called the crate regexp. The PR to Rust involved a discussion about naming that eventually resulted in it being called regex instead.

Two years later in May 2016, I wrote an RFC to release regex 1.0. That took a few months to be approved, but it wasn’t until a couple years later in May 2018 that I actually released regex 1.0.

Before regex 1.0 was released, I had been steadily working on a complete re-imagining of the crate internals. From a commit message in March 2018:

The [regex-syntax] rewrite is intended to be the first phase in an effort to overhaul the entire regex crate.

I didn’t know exactly where I was going at that point in time, but in March 2020, I started work in earnest on rewriting the actual matching engines. A little more than three years later, regex 1.9 has been released with the completed rewrite.

The problems

What kinds of problems were facing the regex crate that warranted a full rewrite? And moreover, why publish the rewritten internals as its own crate?

There are a host of things to discuss here.

Problem: composition was difficult

Following in the tradition of RE2, the regex crate contains a number of different strategies that it can use to implement a search. Sometimes multiple strategies are used in a single search call.

There are generally two dimensions, often at odds with one another, to each strategy: performance and functionality. Faster strategies tend to be more limited in functionality. For example, a fast strategy might be able to report the start and end of a match but not the offsets for each capture group in the regex. Conversely, a slower strategy might be needed to report the offsets of each capture group.

When I originally wrote the regex crate, I implemented a single strategy (the PikeVM) and didn’t do any thoughtful design work for how to incorporate alternative strategies. Eventually, new strategies were added organically:

A BoundedBacktracker that can report capture group offsets like...

Regex engine internals as a library (2023)

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan