"Optimizing" Concurrent Regexes

ayende1 pts0 comments

"Optimizing" concurrent regexes - Ayende @ Rahien

Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net<br>+972 52-548-6969

Posts: 7,642

Comments: 51,261

Copyright ©️ Ayende Rahien 2004 — 2026

Privacy Policy<br>Terms

time to read 6 min | 1131 words

I am looking into some regex work, and I ran into a performance problem. I need to run a particular regex over a large number (millions) of strings. That caused my spidey sense to… tingle.<br>The code in question looked something like this:<br>long matches = CountMatchingEntries(new Regex(@"\s+user id\s+"));<br>Creating a regex for each invocation is… expensive. That is why we have the RegexOptions.Compiledflag, after all. And the Regex class is thread-safe, so I did the equivalent of this code:<br>// class level<br>static Regex s_regex = new Regex(@"\s+user id\s+", RegexOptions.Compiled);<br>// inside a method<br>long matches = CountMatchingEntries(s_regex);The performance of the system immediately took a big, stinking performance regression all over my benchmarks. At first, I was sure that I wasn&rsquo;t doing something properly, so I re-wrote this using the modern approach, with source generators, like this:<br>public partial class MyClass

[GeneratedRegex(@"\s+user id\s+",<br>RegexOptions.Compiled | RegexOptions.CultureInvariant |<br>RegexOptions.IgnoreCase)]<br>public static partial Regex UserIdRegex();<br>That had the exact same behavior, a major performance regression.<br>We are talking about this making the code six times slower. That is a huge cost for something that every fiber in my being tells me should be faster. As I started digging into things, I managed to reproduce this in an isolated manner.<br>The following benchmark shows the core of the problem:<br>using System.Diagnostics;<br>using System.Text.RegularExpressions;

var lines = Enumerable.Range(0, 100_000).Select(i => $"The user id #{i}").ToArray();

var regex = new Regex(@"\s+user id\s+", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);<br>// This is fast<br>Exec(() => lines.Count(l => regex.IsMatch(l)));<br>// This is slow<br>Exec(() => lines.AsParallel().Count(l => regex.IsMatch(l)));

static void Exec(Funcint> run)<br>var before = GC.GetTotalAllocatedBytes();<br>var sw = Stopwatch.StartNew();<br>//var count = lines.AsParallel().Count(l => MyClass.UserIdRegex().IsMatch(l));<br>var count = run();<br>sw.Stop();<br>var after = GC.GetTotalAllocatedBytes();<br>Console.WriteLine($"Count: {count}, Time: {sw.ElapsedMilliseconds} ms, Memory: {after - before} bytes");<br>I create the sameRegex instance and run it 100,000 times. The first time, I do that using a single thread, and the second time I&rsquo;m using multiple threads.<br>The only difference between those runs is the addition of AsParallel() to force it to use multiple threads. My code isn&rsquo;t actually using this or explicit threads, but it is using a single cached Regex instance in a web environment, so under load, it is being used concurrently.<br>What is going on here? The parallel code is much slower because it allocates. It turns out that deep in the guts of .NET&rsquo;s Regex engine we have this block of code:<br>RegexRunner runner = Interlocked.Exchange(ref _runner, null) ?? CreateRunner();<br>try<br>// do work<br>finally<br>_runner = runner;<br>In other words, the actual RegexRunner needs to keep some mutable state, which doesn&rsquo;t play nicely with threads. In order to maintain itself under concurrent usage, the Regex &ldquo;checks out&rdquo; the instance when it is being used, so it can be the sole owner.<br>Other callers on the same instance at the same time will allocate a new runner instance. That is what is causing the massive slowdown. If we are using the Regex instance in a single-threaded manner, the code will check it in & out as needed, with zero allocations and quite fast.<br>If you are using that from concurrent code, you&rsquo;ll allocate like crazy and can expect your performance to drop by as much as five times.<br>I created an issue for that, since I believe this is quite a tripping hazard in terms of performance.<br>The current fix that I found was to use ThreadLocal for this, ensuring that there is no actual concurrent usage, at the cost of higher memory usage and repeated initializations.

Tweet

Share

Share

1 comments

Tags:

performance

raven

ravendb.net

Related posts that you may find interesting:

Comments

I follow your blog for decades now. Posts like this are the reason I'm coming back.

For the issue: wow, this is ugly and totally unexpected.

Comment preview

Join the conversation...

Name

Email

Url

Comments

Human?

Markdown formatting

ESC to close

&times;

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic* **bold**<br>_italic_ __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere<br>else in the doc, define the link:<br>[id]: http://example.com/ "Title"

Images

Inline (titles are...

regex using code regexoptions performance count

Related Articles