Pegular Expressions

tosh1 pts0 comments

Janet for MortalsChapter Four: Pegular Expressions<br>Janet does not have native, built-in regular expressions.You can use a third-party regular expression library if you really have to, I dunno, validate an email address or something. But most of the time, if you’re writing Janet, you’ll be writing PEGs instead.PEG stands for “parsing expression grammar,” which is a mouthful, so I’m going to stick with the acronym, even though I just wrote a whole chapter about macros without abbreviating AST once.As a first — extremely crude — approximation, you can think of PEGs as an alternative notation for writing regular expressions. That’s not actually correct — PEGs are quite a bit more powerful than regular expressions, for starters, and they behave differently in a few respects — but we have to start somewhere, and this will let us re-use a lot of our existing knowledge.Here, let’s look at a few random regular expressions, and see how we’d write them in PEG format.<br>regex: .*<br>peg: (any 1)<br>1 means “match one byte.” any means “zero or more.”

regex: (na)+<br>peg: (some "na")<br>Strings match literally. There are no special characters to escape. some means “one or more.”

regex: \w{1,3}<br>peg: (between 1 3 (choice :w "_"))<br>peg: (between 1 3 (+ :w "_"))<br>Janet’s :w does not include _, so we use + to say “word character or underscore.” (+ ...) is an alias for (choice ...). between is inclusive on both ends.

regex: [^a-z-]<br>peg: (not (choice "-" (range "az")))<br>peg: (! (+ "-" (range "az")))<br>(! ...) is an alias for (not ...). You can negate any PEG, not just character classes.

regex: [a-z][0-9]?<br>peg: (sequence (range "az") (opt (range "09")))<br>peg: (* (range "az") (? (range "09")))<br>* matches all of its arguments in order. (* ...) is an alias for (sequence ...). ? means “zero or one,” and (? ...) is an alias for (opt ...).

Those are pretty random examples, and this is nowhere near an exhaustive list, but it’s enough for you to start forming a general idea. Let’s notice a few things from this:PEGs are quite a bit more verbose than regular expressions.<br>I wouldn’t want to write PEGs for, like, searching in my text editor, but in code I think the verbosity is almost always a good thing: it makes them easier to read and easier to modify.And when you’re writing “real” PEGs, you will break up large patterns into smaller, named components, which will prevent any single pattern from becoming unwieldy.<br>PEGs use a lot of characters that usually mean something else.<br>Like (+ first second third). That’s not addition; it’s choice. How does that work?Well, I didn’t state this explicitly, but PEGs are actually written quoted. A PEG is not (some "na"); it’s actually ['some "na"]. There is no function called some; the symbol itself is meaningful to the functions that consume PEGs.It’s conventional to write PEGs as quasiquoted forms: ~(some "na"), so that you can easily interpolate other values into them. (We’ll get to that soon.)<br>PEGs are structured trees, rather than strings.<br>This makes it easy to compose PEGs out of smaller pieces, which we’ll start to do soon, or to write functions that manipulate PEGs in the same way that we are used to manipulating abstract syntax trees.PEGs aren’t Janet abstract syntax trees, but you can see that they have a lot in common: they represent a tree structure out of nested tuples, lots of quoted symbols, and some numbers or strings or other values mixed in as well. In fact there is a general term for this kind of value: both abstract syntax trees and PEGs are examples of “symbolic expressions.”<br>Alright, now let’s talk about some of the ways that these patterns differ from their regular expression equivalents.First off, PEGs are always anchored to the beginning of the input, so there’s no equivalent “start of input” pattern. So (any 1) is actually equivalent to the regular expression ^.*.Except, no, that’s not strictly true. Because PEGs do not backtrack. Which means that all repetition is implicitly “possessive,” to use the regular expression term. So (any 1) is actually actually equivalent to ^.*+, which is not a construct that JavaScript’s regular expression engine supports.<br>The distinction is irrelevant in this case, but it matters for something like [ab]*[bc] — that will match bbb, but [ab]*+[bc], or the equivalent PEG (* (any (+ "a" "b")) (+ "b" "c")), will not.<br>PEGs do backtrack when using the choice combinator, as well as a few others. But backtracking is always obvious and explicit, as opposed to regular expressions’ implicit backtracking everywhere. This makes it less likely that you’ll accidentally write a PEG that executes in exponential time.Alright. There’s one more thing we should talk about before we get to a concrete example: numbers.We’ve seen 1 already, as a way to match any byte. You can write any other integer — 2, say, or even 3 — to match exactly that number of bytes.But you can also write negative numbers. Negative numbers don’t advance the input at all, and they fail if you could advance that...

pegs regular expressions write expression actually

Related Articles