mvzr: Minimum Viable Zig Regex

tosh2 pts0 comments

GitHub - mnemnion/mvzr: Minimum Viable Zig Regex · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

mnemnion

mvzr

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star<br>128

trunk

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>144 Commits<br>144 Commits

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.zig

build.zig

build.zig.zon

build.zig.zon

pathos.out

pathos.out

View all files

Repository files navigation

mvzr: The Minimum Viable Zig Regex Library

Finding myself in need of a regular expressions library for a Zig<br>project, and needing it to build regex at runtime, not just comptime,<br>I ended up speedrunning a little library for just that purpose.

This is that library. It's a simple bytecode-based VM, inspired by<br>LPEG. Under 2000<br>lines of load-bearing code, no dependencies other than std.

The provided Regex type allows 64 'operations' and 8 unique ASCII<br>character sets. If you would like more, or less, you can call<br>SizedRegex(num_ops, num_sets) to customize the type.

Installation

Drop the file into your project, or use the Zig build system:

zig fetch --save "https://github.com/mnemnion/mvzr/archive/refs/tags/v0.3.9.tar.gz"

I'll do my best to keep that URL fresh, but it pays to check over here:

For the latest release version.

v0.3.9 only differs from v0.3.8 in metadata, marking it as<br>Zig 0.16 compatible. It works fine with Zig 0.15.2, but has the<br>.minimum_zig_version field in the Zon file set higher to cooperate<br>with modern practices.

Features

Zero allocation, comptime and runtime compiling and matching

X operations per regex

Y character sets per regex

Greedy qualifiers: *, +, ?

Lazy qualifiers: *?, +?, ??

Possessive/eager qualifiers: *+, ++, ?+

Alternation: foo|bar|baz

Grouping foo|(bar|baz)+|quux

Sets: [abc], [^abc], [a-z], [^a-z], [\w+-], [\x04-\x1b]

Built-in character groups (ASCII): \w, \W, \s, \S, \d, \D

Escape sequences: \t, \n, \r, \xXX hex format

Same set as Zig: if you need the weird C ones, use \x format

Begin and end ^ and $

Word boundaries \b, \B

{M}, {M,}, {M,N}, {,N}

Limitations and Quirks

Minimal multibyte / Unicode support

This has improved somewhat. A regex like λ? now matches an<br>optional lambda, not just an optional final byte. Additionally,<br>ranges of bytes greater than 0x7f are now supported, this (with<br>some care) can match certain sets: for instance (\xce[\x91- \xa9])+ will match a string of uppercase Greek letters,<br>\xc2[\x80-\x9f] matches a C1 control code, and so on. But<br>you'll still need to work at the byte level, and use \x format,<br>to do these tasks.

No fancy modifiers (you want case-insensitive, great, lowercase your<br>string)

. matches any one byte. [^\n\r] works fine if that's not what you<br>want

Or split into lines first, divide and conquer

Note: $ permits a final newline, but ^ must be the beginning<br>of a string, and $ only matches a final newline.

Backtracks (sorry. For this design to work without backtracking,<br>we need async back)

Compiler does some best-effort validation but I haven't really pounded<br>on it

No capture groups. Divide and conquer

As long as you color within the lines, it should be fine.

This library is not intended for use where an attacker could conceivably<br>control the regex pattern.

Much like managing your own memory, if you know your tools and are smart<br>about it, you can get a lot done with mvzr.

Interface

mvzr.Regex is available at comptime or runtime, and returns an<br>mvzr.Match, consisting of a .slice field containing the match,<br>as well as the .start and .end locations in the haystack. This<br>is a borrowed slice, to own it, call match.toOwnedMatch(allocator),<br>and deallocate later with match.deinit(allocator), or just free the<br>.slice.

Similarly, if you need to store a Regex or SizedRegex for<br>later, call regex.toOwnedRegex(allocator), freeing later with<br>allocator.destroy(heap_regex).

// aka SizedRegex(64, 8)<br>const regex: mvzr.Regex = mvzr.compile(patt_str).?;<br>// or mvzr.Regex.compile(patt_str)<br>const match: mvzr.Match =...

regex mvzr match build search library

Related Articles