ongoing by Tim Bray · Automata, Built For Comfort or Speed
Automata, Built For Comfort or Speed
It’s been a while. February was the last entry in this<br>Quamina Diary; I never stopped working on it but there hasn’t been<br>much blogworthy. This piece offers a progress update for those who’ve been entertained by Quamina, and also a<br>pleasing (well, to me anyhow) dip into finite-automata theory and practice. With numbers and graphs and a bluesman!
Since it’s been so long, here’s what Quamina, a Go-language library, does: You can add “Patterns” to a Quamina instance, they<br>match the values of fields in JSON objects, and then you show Quamina a JSON “Event” and it’ll tell you which Patterns matched<br>it. It’s pleasingly fast and in many cases the speed is not strongly affected by the number of Patterns you’re trying to<br>match.
The last release, V2.0.2 back in March, marked the arrival of a reasonably-full regular-expression dialect into Quamina’s<br>pattern language, so you can say, for example, that the Filename field has to match<br>map-[0-9]+.(png|pdf|jpe?g).<br>Regular expressions were a lot of work and a lot of fun.
Finite-automaton wrangling ·<br>Quamina works by compiling the Patterns into finite automata, then matching the Event fields by traversing them.<br>To handle multiple Patterns, Quamina merges the automata together, a technique that’s been well-studied in Computer Science for<br>decades. From a set-theoretic point of view it’s a union; an OR through Boolean lenses. There’s really no limit in theory<br>for how many patterns you can merge, and Quamina has no problem at all with thousands.
There are two flavors of finite automata: Deterministic (DFAs) and Nondeterministic (NFAs). If you have a university<br>degree in CS they probably made you study this at some point. Weirdly, many people find the subject boring.
I’m not going to explain the differences, if you want to know it’s easy to look up, and anyhow the story I want to tell<br>doesn’t need it. Here’s what the story needs:
To implement wildcards and regular expressions, you need NFAs.
Matching data with DFAs is generally a whole lot faster than with NFAs.
There’s a well-known algorithm for transforming any NFA into a DFA that matches exactly the same data. It’s really a<br>lovely algorithm, you want to tickle it under its adorable little chin.
In some cases, the algorithm can be brutally expensive, along the lines of O(2N).
The benchmark from Hell ·<br>I found a data file in the Wordle source code with 13K or so 5-letter words. I inserted a * at a random location<br>in each. To illustrate, here are the first five results: aah*ed, aalii*, *aargh,<br>aar*ti, and abaca*.
The benchmark is, you load these into a Quamina instance and measure how fast it is to load them, how much memory the<br>merged NFA consumes, and how many Patterns per second you can match. Then you repeat the exercise, converting each NFA into<br>a DFA.
I loaded 10K Patterns in NFA form, but the NFA-to-DFA conversion got so brutal my patience only allowed me to load and<br>measure 300 in DFA form. (Which kind of gives away this article’s punchline but stick with me, the details are fun).
Willie Dixon ·<br>He was a famous bluesman and I need help from him to explain my APIs.
Quamina’s latest PR adds APIs<br>to turn the NFA-to-DFA conversion on and off; it’s off by default. There are two MatcherBuildMode values,<br>BuiltForComfort and BuiltForSpeed. Which are out-takes from Willie’s 1959 song Built For<br>Comfort, and the lyrics go like this:
Some people built like this, some people built like that,
The way I’m built, don’t you call me fat
Cause I’m built for comfort, I ain’t built for speed
Willie was a big guy, way over six feet and not exactly slender. I know this because I interviewed him once, but that’s<br>another story. Lots of others have covered this tune, most of whom are, um, well-rounded.
Speaking of which… here is Quamina’s memory consumption in NFA mode.
The memory cost is pretty well linear in the number of patterns. For 10K patterns Quamina uses something over 25M which feels<br>kind of reasonable to me.
Here’s the story for DFAs.
I haven’t run regressions or anything and I don’t think that’s actually O(2N), but the second<br>derivative is solidly positive, which in simple English means “doesn’t scale”.
Now let’s look at the Pattern-addition time for NFAs and then DFAs.
Once again, the NFA cost increases more or less linearly (albeit interestingly jagged), while for DFAs, look at that<br>second derivative. This was what eventually made my patience give out and limit the DFA run to 300 Patterns.
Now let’s look at the payoff, the number of Event matches/second in NFA- and DFA-land.
The NFA performance degrades at something like O(N-1), but DFAs just don’t care. After a few lurches<br>it rolls along at between 500 and 600K/second. Not bad, and based on experience with huge DFAs that were built from<br>exact string matches, my belief is that the matching speed would eventually drift down, but slowly, not remotely a...