Simple but Powerful Pratt Parsing
Welcome to my article about Pratt parsing — the monad tutorial of syntactic analysis. The number of Pratt parsing articles is so large that there exists a survey post :)
The goals of this particular article are:
- Raising an issue that the so-called left-recursion problem is overstated.
- Complaining about inadequacy of BNF for representing infix expressions.
- Providing a description and implementation of Pratt parsing algorithm which sticks to the core and doesn’t introduce a DSL-y abstraction.
- Understanding the algorithm myself for hopefully the last time. I’ve implemented a production-grade Pratt parser once, but I no longer immediately understand that code :-)
This post assumes a fair bit of familiarity with parsing techniques, and, for example, does not explain what a context free grammar is.
Introduction
Parsing is the process by which a compiler turns a sequence of tokens into a tree representation:
There are many approaches to this task, which roughly fall into one of the broad two categories:
- Using a DSL to specify an abstract grammar of the language
- Hand-writing the parser
Pratt parsing is one of the most frequently used techniques for hand-written parsing.
BNF
The pinnacle of syntactic analysis theory is discovering the context free grammar notation (often using BNF concrete syntax) for decoding linear structures into trees:
I remember being fascinated by this idea, especially by parallels with natural language sentence structure. However, my optimism quickly waned once we got to describing expressions. The natural expression grammar indeed allows one to see what is an expression.
Although this grammar looks great, it is in fact ambiguous and imprecise, and needs to be rewritten to be amendable to automated parser generation. Specifically, we need to specify precedence and associativity of operators. The fixed grammar looks like this:
To me, the “shape” of expressions feels completely lost in this new formulation. Moreover, it took me three or four courses in formal languages before I was able to reliably create this grammar myself.
And that’s why I love Pratt parsing — it is an enhancement of recursive descent parsing algorithm, which uses the natural terminology of precedence and associativity for parsing expressions, instead of grammar obfuscation techniques.
Recursive descent and left-recursion
The simplest technique for hand-writing a parser is recursive descent, which models the grammar as a set of mutually recursive functions. For example, the above item grammar fragment can look like this:
Traditionally, text-books point out left-recursive grammars as the Achilles heel of this approach, and use this drawback to motivate more advanced LR parsing techniques. An example of problematic grammar can look like this:
Indeed, if we naively code the sum
function, it wouldn’t be too useful:
- At this point we immediately loop and overflow the stack
A theoretical fix to the problem involves rewriting the grammar to eliminate the left recursion. However in practice, for a hand-written parser, a solution is much simpler — breaking away with a pure recursive paradigm and using a loop:
Pratt parsing, the general shape
Using just loops won’t be enough for parsing infix expressions. Instead, Pratt parsing uses both loops and recursion:
Not only does it send your mind into Möbeus-shaped hamster wheel, it also handles associativity and precedence!
From Precedence to Binding Power
I have a confession to make: I am always confused by “high precedence” and “low precedence”. In a + b * c
, addition has a lower precedence, but it is at the top of the parse tree…
So instead, I find thinking in terms of binding power more intuitive.
The *
is stronger, it has more power to hold together B
and C
, and so the expression is parsed as
A + (B * C)
.
What about associativity though? In A + B + C
all operators seem to have the same power, and it is unclear which +
to fold first.
But this can also be modelled with power, if we make it slightly asymmetric:
Here, we pumped the right power of +
just a little bit, so that it holds the right operand tighter.
We also added zeros at both ends, as there are no operators to bind from the sides.
Here, the first (and only the first) +
holds both of its arguments tighter than the neighbors, so we can reduce it:
Now we can fold the second plus and get (A + B) + C
.
Or, in terms of the syntax tree, the second +
really likes its right operand more than the left one, so it rushes to get hold of C
.
While he does that, the first +
captures both A
and B
, as they are uncontested.
What Pratt parsing does is that it finds these badass, stronger than neighbors operators, by processing the string left to right.
We are almost at a point where we finally start writing some code, but let’s first look at the other running example.
We will use function composition operator, .
(dot) as a right associative operator with a high binding power.
That is, f . g . h
is parsed as f . (g . h)
, or, in terms of power
Minimal Pratt Parser
We will be parsing expressions where basic atoms are single character numbers and variables, and which uses punctuation for operators. Let’s define a simple tokenizer:
To make sure that we got the precedence binding power correctly, we will be transforming infix expressions into a gold-standard (not so popular in Poland, for whatever reason) unambiguous notation — S-expressions:
1 + 2 * 3 == (+ 1 (* 2 3))
.
And let’s start with just this: expressions with atoms and two infix binary operators, +
and *
:
So, the general approach is roughly the one we used to deal with left recursion — start with parsing a first number, and then loop, consuming operators and doing … something?
- Note that we already can parse this simple test!
We want to use this power idea, so let’s compute both left and right powers of the operator.
We’ll use u8
to represent power, so, for associativity, we’ll add 1
.
And we’ll reserve the 0
power for the end of input, so the lowest power operator can have is 1
.
And now comes the tricky bit, where we introduce recursion into the picture. Let’s think about this example (with powers below):
The cursor is at the first +
, we know that the left bp
is 1
and the right one is 2
.
The lhs
stores a
.
The next operator after +
is *
, so we shouldn’t add b
to a
.
The problem is that we haven’t yet seen the next operator, we are just past +
.
Can we add a lookahead?
Looks like no — we’d have to look past all of b
, c
and d
to find the next operator with lower binding power, which sounds pretty unbounded.
But we are onto something!
Our current right priority is 2
, and, to be able to fold the expression, we need to find the next operator with lower priority.
So let’s recursively call expr_bp
starting at b
, but also tell it to stop as soon as bp
drops below 2
.
This necessitates the addition of min_bp
argument to the main function.
And lo, we have a fully functioning minimal Pratt parser:
-
min_bp
argument is the crucial addition.expr_bp
now parses expressions with relatively high binding power. As soon as it sees something weaker thanmin_bp
, it stops. - This is the “it stops” point.
-
And here we bump past the operator itself and make the recursive call.
Note how we use
l_bp
to check againstmin_bp
, andr_bp
as the newmin_bp
of the recursive call. So, you can think aboutmin_bp
as the binding power of the operator to the left of the current expressions. - Finally, after parsing the correct right hand side, we assemble the new current expression.
- To start the recursion, we use binding power of zero. Remember, at the beginning the binding power of the operator to the left is the lowest possible, zero, as there’s no actual operator there.
So, yup, these 40 lines are the Pratt parsing algorithm. They are tricky, but, if you understand them, everything else is straightforward additions.
Bells and Whistles
Now let’s add all kinds of weird expressions to show the power and flexibility of the algorithm.
First, let’s add a high-priority, right associative function composition operator: .
:
Yup, it’s a single line! Note how the left side of the operator binds tighter, which gives us desired right associativity:
Now, let’s add unary -
, which binds tighter than binary arithmetic operators, but less tight than composition.
This requires changes to how we start our loop, as we no longer can assume that the first token is an atom, and need to handle minus as well.
But let the types drive us.
First, we start with binding powers.
As this is an unary operator, it really only have right binding power, so, ahem, let’s just code this:
-
Here, we return a dummy
()
to make it clear that this is a prefix, and not a postfix operator, and thus can only bind things to the right. -
Note, as we want to add unary
-
between.
and*
, we need to shift priorities of.
by two. The general rule is that we use an odd priority as base, and bump it by one for associativity, if the operator is binary. For unary minus it doesn’t matter and we could have used either5
or6
, but sticking to odd is more consistent.
Plugging this into expr_bp
, we get:
Now, we only have r_bp
and not l_bp
, so let’s just copy-paste half of the code from the main loop?
Remember, we use r_bp
for recursive calls.
Amusingly, this purely mechanical, type-driven transformation works. You can also reason why it works, of course. The same argument applies; after we’ve consumed a prefix operator, the operand consists of operators that bind tighter, and we just so conveniently happen to have a function which can parse expressions tighter than the specified power.
Ok, this is getting stupid.
If using ((), u8)
“just worked” for prefix operators, can (u8, ())
deal with postfix ones?
Well, let’s add !
for factorials. It should bind tighter than -
, because -(92!)
is obviously more useful than (-92)!
.
So, the familiar drill — new priority function, shifting priority of .
(this bit is annoying in Pratt parsers), copy-pasting the code…
Wait, something’s wrong here.
After we’ve parsed the prefix expression, we can see either a postfix or an infix operator.
But we bail on unrecognized operators, which is not going to work…
So, let’s make postfix_binding_power
to return an option, for the case where the operator is not postfix:
Amusingly, both the old and the new tests pass.
Now, we are ready to add a new kind of expression: parenthesised expression. It is actually not that hard, and we could have done it from the start, but it makes sense to handle this here, you’ll see in a moment why. Parens are just a primary expressions, and are handled similar to atoms:
Unfortunately, the following test fails:
The panic comes from the loop below — the only termination condition we have is reaching eof, and )
is definitely not eof.
The easiest way to fix that is to change infix_binding_power
to return None
on unrecognized operands.
That way, it’ll become similar to postfix_binding_power
again!
And now let’s add array indexing operator: a[i]
.
What kind of -fix is it?
Around-fix?
If it were just a[]
, it would clearly be postfix.
if it were just [i]
, it would work exactly like parens.
And it is the key: the i
part doesn’t really participate in the whole power game, as it is unambiguously delimited. So, let’s do this:
-
Note that we use the same priority for
!
as for[
. In general, for the correctness of our algorithm it’s pretty important that, when we make decisions, priorities are never equal. Otherwise, we might end up in a situation like the one before tiny adjustment for associativity, where there were two equally-good candidates for reduction. However, we only compare rightbp
with leftbp
! So for two postfix operators it’s OK to have priorities the same, as they are both right.
Finally, the ultimate boss of all operators, the dreaded ternary:
Is this … all-other-the-place-fix operator? Well, let’s change the syntax of ternary slightly:
And let’s recall that a[i]
turned out to be a postfix operator + parenthesis…
So, yeah, ?
and :
are actually a weird pair of parens!
And let’s handle it as such!
Now, what about priority and associativity?
What associativity even is in this case?
To figure it out, we just squash the parens part:
This can be parsed as
or as
What is more useful?
For ?
-chains like this:
the right-associative reading is more useful.
Priority-wise, the ternary is low priority.
In C, only =
and ,
have lower priority.
While we are at it, let’s add C-style right associative =
as well.
Here’s our the most complete and perfect version of a simple Pratt parser:
The code is also available in this repository, Eof :-)