A beginner-friendly approach to Pratt Parsing

Parsing Math with Pratt Parsing

WASHINGTON RAMOS' WEBSITE

LINKEDIN GITHUB

Parsing Math with Pratt Parsing

A few weeks ago I decided to start a challenge to build a Lox interpreter in order to learn C++. In the process, I learned about a technique for parsing called Pratt Parsing, one that is not very mainstream and thus not very well taught or documented. In this post, I'll try to teach you the basics of parsing using Pratt and by the end of it we should be able to parse and interpret relatively complicated math expressions.

I'll assume you know a thing or two about C++, but the code here is simple enough that someone coming from Java or C# could also read it. If you are from C++: some correctness and ceremonies were sacrificed for the sake of education.

This post is also a little beginner-friendly, so if you already understand what a lexer and ASTs are, you might feel like it's a little too slow. If not, have fun reading.

Understanding the pipeline

First of all, let's have an overview of what we'll be doing. Parsing is the process of transforming raw text into a representational state that can either be interpreted or compiled. In our case, we'll transform (parse) raw text into an Abstract Syntax Tree that can be interpreted by a piece of code. The pipeline looks like this:

raw text -> lexer -> parser -> Abstract Syntax Tree -> interpreter -> output

Raw text will be our math expressions, like "-5 * 10 + (25 / 5)". Then, we'll have a Lexer; a Lexer is a piece of code that transforms raw text into chunks of data, often referred to as Tokens. In our case, the tokens would be:

- (unary minus) 5 (number 5) * (multiplication) 10 (number 10) + (plus) ( (opening parenthesis) 25 (number 25) / (division) 5 (number 5) ) (closing parenthesis)

Then, we'll feed all those tokens into a parser that will transform them into an Abstract Syntax Tree, which we'll refer to as "AST" from now on for brevity. Our AST would look like this:

(+) / \ / \ (*) (/) / \ / \ (-) 10 25 5

From here, we'll (hopefully) have already learned Pratt's technique, but we'll keep going and have it interpreted just so we can view the output of our parser.

The Lexer

The Lexer is almost the simplest part of our project, second only to Pratt's algorithm main loop. Some people do not enjoy writing lexers by hand, but we'll write one anyway for the sake of education. Before we get into the lexer, let's first define Token, as it is the output of a lexer:

enum class TokenType { NUMBER, PLUS, MINUS, DIVISION, MULTIPLICATION, LEFT_PARENTHESIS, RIGHT_PARENTHESIS, END_OF_FILE };

struct Token { TokenType type; std::string value; };

The more complex your project becomes the more complex these two become. For example, the Token could also have fields to save where it was seen - like which line and column; there could be more token types too, were we parsing a programming language we would probably have token types like "IF" and "WHILE" to represent those tokens.

Next, the lexer's interface:

struct Lexer { Lexer(std::string_view &string_view) : string_view(string_view) {} Token NextToken();

private: uint32_t pos = 0; std::string_view &string_view; char Peek() const; };

Notice how thin it is! Indeed, a lexer doesn't provide much more than "the next token" in the raw text it's provided with, but that is all we'll need. Now, to its implementation:

char Lexer::Peek() const { if (pos >= string_view.size()) return '\0';

return string_view[pos];

Token Lexer::NextToken() { while (pos string_view.size() && std::isspace(static_castunsigned char>(Peek())))) { ++pos;

if (pos >= string_view.size()) return Token(TokenType::END_OF_FILE, "");

const char c = Peek();

if (std::isdigit(static_castunsigned char>(c))) { std::string val; while (std::isdigit(static_castunsigned char>(Peek())) || Peek() == '.') { val += Peek(); ++pos; return Token(TokenType::NUMBER, val);

++pos; switch (c) { case '+': return Token(TokenType::PLUS, "+"); case '-': return Token(TokenType::MINUS, "-"); case '*': return Token(TokenType::MULTIPLICATION, "*"); case '/': return Token(TokenType::DIVISION, "/"); case '(': return Token(TokenType::LEFT_PARENTHESIS, "("); case ')': return Token(TokenType::RIGHT_PARENTHESIS, ")"); default: throw std::runtime_error("Unexpected character");

The lexer looks at the current character and if it's a whitespace it skips it until it finds a valid character. After finding a valid character, it returns a token representing it. If it finds a digit, it builds a string of the number it found.

Notice how it doesn't try to create numbers or interpret them, like turning "-5" into a double with the value "-5", instead it will return two separate tokens: "-" and "5"; that is because the lexer is supposed to just return tokens, not interpret them - that is the parser's responsibility. The lexer is as dummy as it gets.

Another important point is that the tokens a lexer returns or supports depends on what we're parsing. As I've said before, if we were...

A beginner-friendly approach to Pratt Parsing

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast