Resilient LL Parsing Tutorial
In this tutorial, I will explain a particular approach to parsing, which gracefully handles syntax errors and is thus suitable for language servers, which, by their nature, have to handle incomplete and invalid code. Explaining the problem and the solution requires somewhat less than a trivial worked example, and I want to share a couple of tricks not directly related to resilience, so the tutorial builds a full, self-contained parser, instead of explaining abstractly just the resilience.
The tutorial is descriptive, rather than prescriptive — it tells you what you can do, not what you should do.
- If you are looking into building a production grade language server, treat it as a library of ideas, not as a blueprint.
- If you want to get something working quickly, I think today the best answer is “just use Tree-sitter”, so you’d better read its docs rather than this tutorial.
- If you are building an IDE-grade parser from scratch, then techniques presented here might be directly applicable.
Why Resilience is Needed?
Let’s look at one motivational example for resilient parsing:
Here, a user is in the process of defining the fib_rec
helper function.
For a language server, it’s important that the incompleteness doesn’t get in the way.
In particular:
-
The following function,
fib
, should be parsed without any errors such that syntax and semantic highlighting is not disturbed, and all calls tofib
elsewhere typecheck correctly. -
The
fib_rec
function itself should be recognized as a partially complete function, so that various language server assists can help complete it correctly. -
In particular, a smart language server can actually infer the expected type of
fib_rec
from a call we already have, and suggest completing the whole prototype. rust-analyzer doesn’t do that today, but one day it should.
Generalizing this example, what we want from our parser is to recognize as much of the syntactic structure as feasible. It should be able to localize errors — a mistake in a function generally should not interfere with parsing unrelated functions. As the code is read and written left-to-right, the parser should also recognize valid partial prefixes of various syntactic constructs.
Academic literature suggests another lens to use when looking at this problem: error recovery.
Rather than just recognizing incomplete constructs, the parser can attempt to guess a minimal edit which completes the construct and gets rid of the syntax error.
From this angle, the above example would look rather like fn fib_rec(f1: u32, /* ) {} */
, where the stuff in a comment is automatically inserted by the parser.
Resilience is a more fruitful framing to use for a language server — incomplete code is the ground truth, and only the user knows how to correctly complete it. An language server can only offer guesses and suggestions, and they are more precise if they employ post-parsing semantic information.
Error recovery might work better when emitting understandable syntax errors, but, in a language server, the importance of clear error messages for syntax errors is relatively lower, as highlighting such errors right in the editor synchronously with typing usually provides tighter, more useful tacit feedback.
Approaches to Error Resilience
The classic approach for handling parser errors is to explicitly encode error productions and synchronization tokens into the language grammar. This approach isn’t a natural fit for resilience framing — you don’t want to anticipate every possible error, as there are just too many possibilities. Rather, you want to recover as much of a valid syntax tree as possible, and more or less ignore arbitrary invalid parts.
Tree-sitter does something more interesting.
It is a GLR parser, meaning that it non-deterministically tries many possible LR (bottom-up) parses, and looks for the best one.
This allows Tree-sitter to recognize many complete valid small fragments of a tree, but it might have trouble assembling them into incomplete larger fragments.
In our example fn fib_rec(f1: u32,
, Tree-sitter correctly recognizes f1: u32
as a formal parameter, but doesn’t recognize fib_rec
as a function.
Top-down (LL) parsing paradigm makes it harder to recognize valid small fragments, but naturally allows for incomplete large nodes. Because code is written top-down and left-to-right, LL seems to have an advantage for typical patterns of incomplete code. Moreover, there isn’t really anything special you need to do to make LL parsing resilient. You sort of… just not crash on the first error, and everything else more or less just works.
Details are fiddly though, so, in the rest of the post, we will write a complete implementation of a hand-written recursive descent + Pratt resilient parser.
Introducing L
For the lack of imagination on my side, the toy language we will be parsing is called L
.
It is a subset of Rust, which has just enough features to make some syntax mistakes.
Here’s Fibonacci:
Note that there’s no base case, because L doesn’t have syntax for if
.
Here’s the syntax it does have, as an ungrammar:
The meta syntax here is similar to BNF, with two important differences:
- the notation is better specified and more familiar (recursive regular expressions),
- it describes syntax trees, rather than strings (sequences of tokens).
Single quotes signify terminals: 'fn'
and 'return'
are keywords, 'name'
stands for any identifier token, like foo
, and '('
is punctuation.
Unquoted names are non-terminals. For example, x: i32,
would be an example of Param
.
Unquoted punctuation are meta symbols of ungrammar itself, semantics identical to regular expressions. Zero or more repetition is *
, zero or one is ?
, |
is alternation and ()
are used for grouping.
The grammar doesn’t nail the syntax precisely. For example, the rule for Param
, Param = 'name' ':' Type ','?
, says that Param
syntax node has an optional comma, but there’s nothing in the above ungrammar
specifying whether the trailing commas are allowed.
Overall, L
has very little to it — a program is a series of function declarations, each function has a body which is a sequence of statements, the set of expressions is spartan, not even an if
. Still, it’ll take us some time to parse all that.
But you can already try the end result in the text-box below.
The syntax tree is updated automatically on typing.
Do make mistakes to see how a partial tree is recovered.
Designing the Tree
A traditional AST for L might look roughly like this:
Extending this structure to be resilient is non-trivial. There are two problems: trivia and errors.
For resilient parsing, we want the AST to contain every detail about the source text. We actually don’t want to use an abstract syntax tree, and need a concrete one. In a traditional AST, the tree structure is rigidly defined — any syntax node has a fixed number of children. But there can be any number of comments and whitespace anywhere in the tree, and making space for them in the structure requires some fiddly data manipulation. Similarly, errors (e.g., unexpected tokens), can appear anywhere in the tree.
One trick to handle these in the AST paradigm is to attach trivia and error tokens to other tokens.
That is, for something like
fn /* name of the function -> */ f() {}
,
the fn
and f
tokens would be explicit parts of the AST, while the comment and surrounding whitespace would belong to the collection of trivia tokens hanging off the fn
token.
One complication here is that it’s not always just tokens that can appear anywhere, sometimes you can have full trees like that.
For example, comments might support markdown syntax, and you might actually want to parse that properly (e.g., to resolve links to declarations).
Syntax errors can also span whole subtrees.
For example, when parsing pub(crate) nope
in Rust, it would be smart to parse pub(crate)
as a visibility modifier, and nest it into a bigger Error
node.
SwiftSyntax meticulously adds error placeholders between any two fields of an AST node, giving rise to
unexpectedBetweenModifiersAndDeinitKeyword
and such (source, docs).
An alternative approach, used by IntelliJ and rust-analyzer, is to treat the syntax tree as a somewhat dynamically-typed data structure:
This structure does not enforce any constraints on the shape of the syntax tree at all, and so it naturally accommodates errors anywhere. It is possible to layer a well-typed API on top of this dynamic foundation. An extra benefit of this representation is that you can use the same tree type for different languages; this is a requirement for universal tools.
Discussing specifics of syntax tree representation goes beyond this article, as the topic is vast and lacks a clear winning solution. To learn about it, take a look at Roslyn, SwiftSyntax, rowan and IntelliJ.
To simplify things, we’ll ignore comments and whitespace, though you’ll absolutely want those in a real implementation. One approach would be to do the parsing without comments, like we do here, and then attach comments to the nodes in a separate pass. Attaching comments needs some heuristics — for example, non-doc comments generally want to be a part of the following syntax node.
Another design choice is handling of error messages. One approach is to treat error messages as properties of the syntax tree itself, by either inferring them from the tree structure, or just storing them inline. Alternatively, errors can be considered to be a side-effect of the parsing process (that way, trees constructed manually during, eg, refactors, won’t carry any error messages, even if they are invalid).
Here’s the full set of token and tree kinds for our language L:
Things to note:
-
explicit
Error
kinds; - no whitespace or comments, as an unrealistic simplification;
-
Eof
virtual token simplifies parsing, removing the need to handleOption<Token>
; -
punctuators are named after what they are, rather than after what they usually mean:
Star
, rather thanMult
; -
a good set of name for various kinds of braces is
{L,R}{Paren,Curly,Brack,Angle}
.
Lexer
Won’t be covering lexer here, let’s just say we have fn lex(text: &str) -> Vec<Token>
, function. Two points worth mentioning:
-
Lexer itself should be resilient, but that’s easy — produce an
Error
token for anything which isn’t a valid token. - Writing lexer by hand is somewhat tedious, but is very simple relative to everything else. If you are stuck in an analysis-paralysis picking a lexer generator, consider cutting the Gordian knot and hand-writing.
Parser
With homogenous syntax trees, the task of parsing admits an elegant formalization — we want to insert extra parenthesis into a stream of tokens.
Note how the sequence of tokens with extra parenthesis is still a flat sequence. The parsing will be two-phase:
- in the first phase, the parser emits a flat list of events,
- in the second phase, the list is converted to a tree.
Here’s the basic setup for the parser:
-
open
,advance
, andclose
form the basis for constructing the stream of events. -
Note how
kind
is stored in theOpen
event, but is supplied with theclose
method. This is required for flexibility — sometimes it’s possible to decide on the type of syntax node only after it is parsed. The way this works is that theopen
method returns aMark
which is subsequently passed toclose
to modify the correspondingOpen
event. -
There’s a set of short, convenient methods to navigate through the sequence of tokens:
-
nth
is the lookahead method. Note how it doesn’t return anOption
, and usesEof
special value for “out of bounds” indexes. This simplifies the call-site, “no more tokens” and “token of a wrong kind” are always handled the same. -
at
is a convenient specialization to check for a specific next token. -
eat
isat
combined with consuming the next token. -
expect
iseat
combined with error reporting.
These methods are not a very orthogonal basis, but they are a convenience basis for parsing. Finally,
advance_with_error
advanced over any token, but also wraps it into an error node. -
-
When writing parsers by hand, it’s very easy to accidentally write the code which loops or recurses forever. To simplify debugging, it’s helpful to add an explicit notion of “fuel”, which is replenished every time the parser makes progress, and is spent every time it does not.
The function to transform a flat list of events into a tree is a bit involved. It juggles three things: an iterator of events, an iterator of tokens, and a stack of partially constructed nodes (we expect the stack to contain just one node at the end).
Grammar
We are finally getting to the actual topic of resilient parser.
Now we will write a full grammar for L as a sequence of functions.
Usually both atomic parser operations, like fn advance
, and grammar productions, like fn parse_fn
are implemented as methods on the Parser
struct.
I prefer to separate the two and to use free functions for the latter category, as the code is a bit more readable that way.
Let’s start with parsing the top level.
-
Wrap the whole thing into a
File
node. -
Use the
while
loop to parse a file as a series of functions. Importantly, the entirety of the file is parsed; we break out of the loop only when the eof is reached. -
To not get stuck in this loop, it’s crucial that every iteration consumes at least one token. If the token is
fn
, we’ll parse at least a part of a function. Otherwise, we consume the token and wrap it into an error node.
Lets parse functions now:
-
When parsing a function, we assert that the current token is
fn
. There’s some duplication with theif p.at(FnKeyword)
, check at the call-site, but this duplication actually helps readability. -
Again, we surround the body of the function with
open
/close
pair. -
Although parameter list and function body are mandatory, we precede them with an
at
check. We can still report the syntax error by analyzing the structure of the syntax tree (or we can report it as a side effect of parsing in theelse
branch if we want). It wouldn’t be wrong to just remove theif
altogether and try to parseparam_list
unconditionally, but theif
helps with reducing cascading errors.
Now, the list of parameters:
- Inside, we have a standard code shape for parsing a bracketed list. It can be extracted into a high-order function, but typing out the code manually is not a problem either. This bit of code starts and ends with consuming the corresponding parenthesis.
-
In the happy case, we loop until the closing parenthesis.
However, it could also be the case that there’s no closing parenthesis at all, so we add an
eof
condition as well. Generally, every loop we write would have&& !p.eof()
tackled on. - As with any loop, we need to ensure that each iteration consumes at least one token to not get stuck. If the current token is an identifier, everything is ok, as we’ll parse at least some part of the parameter.
Parsing parameter is almost nothing new at this point:
-
This is the only interesting bit.
To parse a comma-separated list of parameters with a trailing comma, it’s enough to check if the following token after parameter is
)
. This correctly handles all three cases:-
if the next token is
)
, we are at the end of the list, and no comma is required; -
if the next token is
,
, we correctly advance past it; -
finally, if the next token is anything else, then it’s not a
)
, so we are not at the last element of the list and correctly emit an error.
-
if the next token is
Parsing types is trivial:
The notable aspect here is naming.
The production is deliberately named TypeExpr
, rather than Type
, to avoid confusion down the line.
Consider fib(92)
.
It is an expression, which evaluates to a value.
The same thing happens with types.
For example, Foo<Int>
is not a type yet, it’s an expression which can be “evaluated” (at compile time) to a type (if Foo
is a type alias, the result might be something like Pair<Int, Int>
).
Parsing a block gets a bit more involved:
Block can contain many different kinds of statements, so we branch on the first token in the loop’s body.
As usual, we need to maintain an invariant that the body consumes at least one token.
For let
and return
statements that’s easy, they consume the fixed first token.
For the expression statement (things like 1 + 1;
) it gets more interesting, as an expression can start with many different tokens.
For the time being, we’ll just kick the can down the road and require stmt_expr
to deal with it (that is, to guarantee that at least one token is consumed).
Statements themselves are straightforward:
Again, for stmt_expr
, we push “must consume a token” invariant onto expr
.
Expressions are tricky. They always are. For starters, let’s handle just the clearly-delimited cases, like literals and parenthesis:
In the catch-all arm, we take care to consume the token, to make sure that the statement loop in block
can always make progress.
Next expression to handle would be ExprCall
.
This requires some preparation.
Consider this example: f(1)(2)
.
We want the following parenthesis structure here:
The problem is, when the parser is at f
, it doesn’t yet know how many Open
events it should emit.
We solve the problem by adding an API to go back and inject a new Open
event into the middle of existing events.
-
Here we adjust
close
to also return aMarkClosed
, such that we can go back and add a new event before it. -
The new API. It is like
open
, but also takes aMarkClosed
which carries an index of anOpen
event in front of which we are to inject a newOpen
. In the current implementation, for simplicity, we just inject into the middle of the vector, which is an O(N) operation worst-case. A proper solution here would be to use an index-based linked list. That is,open_before
can push the new open event to the end of the list, and also mark the old event with a pointer to the freshly inserted one. To store a pointer, an extra field is needed:The loop in
build_tree
needs to follow theopen_before
links.
With this new API, we can parse function calls:
-
expr_delimited
now returns aMarkClosed
rather than()
. No code changes are required for this, asclose
calls are already in the tail position. -
To parse function calls, we check whether we are at
(
and useopen_before
API if that is the case. -
Parsing argument list should be routine by now. Again, as an expression can start with many different tokens, we don’t add an
if p.at
check to the loop’s body, and requirearg
to consume at least one token. -
Inside
arg
, we use an already familiar construct to parse an optionally trailing comma.
Now only binary expressions are left. We will use a Pratt parser for those. This is genuinely tricky code, so I have a dedicated article explaining how it all works:
Simple but Powerful Pratt Parsing .
Here, I’ll just dump a pageful of code without much explanation:
-
In this version of pratt, rather than passing numerical precedence, I pass the actual token (learned that from jamii’s post). So, to determine whether to break or recur in the Pratt loop, we ask which of the two tokens binds tighter and act accordingly.
-
When we start parsing an expression, we don’t have an operator to the left yet, so I just pass
Eof
as a dummy token. -
The code naturally handles the case when the next token is not an operator (that is, when expression is complete, or when there’s some syntax error).
And that’s it! We have parsed the entirety of L!
Basic Resilience
Let’s see how resilient our basic parser is. Let’s check our motivational example:
Here, the syntax tree our parser produces is surprisingly exactly what we want:
For the first incomplete function, we get Fn
, Param
and ParamList
, as we should.
The second function is parsed without any errors.
Curiously, we get this great result without much explicit effort to make parsing resilient, it’s a natural outcome of just not failing in the presence of errors. The following ingredients help us:
- homogeneous syntax tree supports arbitrary malformed code,
- any syntactic construct is parsed left-to-right, and valid prefixes are always recognized,
-
our top-level loop in
file
is greedy: it either parses a function, or skips a single token and tries to parse a function again. That way, if there’s a valid function somewhere, it will be recognized.
Thinking about the last case both reveals the limitations of our current code, and shows avenues for improvement. In general, parsing works as a series of nested loops:
If something goes wrong inside a loop, our choices are:
- skip a token, and continue with the next iteration of the current loop,
- break out of the inner loop, and let the outer loop handle recovery.
The top-most loop must use the “skip a token” solution, because it needs to consume all of the input tokens.
Improving Resilience
Right now, each loop either always skips, or always breaks. This is not optimal. Consider this example:
Here, for f1
we want to break out of param_list
loop, and our code does just that.
For f2
though, the error is a duplicated comma (the user will add a new parameter between x
and z
shortly), so we want to skip here.
We don’t, and, as a result, the syntax tree for f2
is a train wreck:
For parameters, it is reasonable to skip tokens until we see something which implies the end of the parameter list.
For example, if we are parsing a list of parameters and see an fn
token, then we’d better stop.
If we see some less salient token, it’s better to gobble it up.
Let’s implement the idea:
Here, we use at_any
helper function, which is like at
, but takes a list of tokens.
The real implementation would use bitsets for this purpose.
The example now parses correctly:
What is a reasonable RECOVERY
set in a general case?
I don’t know the answer to this question, but follow sets from formal grammar theory give a good intuition.
We don’t want exactly the follow set: for ParamList
, {
is in follow, and we do want it to be a part of the recovery set, but fn
is not in follow, and yet it is important to recover on it.
fn
is included because it’s in the follow for Fn
, and ParamList
is a child of Fn
: we also want to recursively include ancestor follow sets into the recovery set.
For expressions and statements, we have the opposite problem — block
and arg_list
loops eagerly consume erroneous tokens, but sometimes it would be wise to break out of the loop instead.
Consider this example:
It gives another train wreck syntax tree, where the g
function is completely missed:
Recall that the root cause here is that we require expr
to consume at least one token, because it’s not immediately obvious which tokens can start an expression.
It’s not immediately obvious, but easy to compute — that’s exactly first set from formal grammars.
Using it, we get:
This fixes the syntax tree:
There’s only one issue left.
Our expr
parsing is still greedy, so, in a case like this
the let
will be consumed as a right-hand-side operand of +
.
Now that the callers of expr
contain a check for EXPR_FIRST
, we no longer need this greediness and can return None
if no expression can be parsed:
This gives the following syntax tree:
And this concludes the tutorial! You are now capable of implementing an IDE-grade parser for a real programming language from scratch.
Summarizing:
-
Resilient parsing means recovering as much syntactic structure from erroneous code as possible.
-
Resilient parsing is important for IDEs and language servers, who’s job mostly ends when the code does not have errors any more.
-
Resilient parsing is related, but distinct from error recovery and repair. Rather than guessing what the user meant to write, the parser tries to make sense of what is actually written.
-
Academic literature tends to focus on error repair, and mostly ignores pure resilience.
-
The biggest challenge of resilient parsing is the design of a syntax tree data structure. It should provide convenient and type-safe access to well-formed syntax trees, while allowing arbitrary malformed trees.
-
One possible design here is to make the underlying tree a dynamically-typed data structure (like JSON), and layer typed accessors on top (not covered in this article).
-
LL style parsers are a good fit for resilient parsing. Because code is written left-to-right, it’s important that the parser recognizes well-formed prefixes of incomplete syntactic constructs, and LL does just that.
-
Ultimately, parsing works as a stack of nested
for
loops. Inside a singlefor
loop, on each iteration, we need to decide between:- trying to parse a sequence element,
- skipping over an unexpected token,
- breaking out of the nested loop and delegating recovery to the parent loop.
-
first, follow and recovery sets help making a specific decision.
-
In any case, if a loop tries to parse an item, item parsing must consume at least one token (if only to report an error).
Source code for the article is here: https://github.com/matklad/resilient-ll-parsing/blob/master/src/lib.rs#L44 .