Modern Parser Generator Jun 6, 2018

Hi! During the last couple of years, I’ve spent a lot of time writing parsers and parser generators, and I want to write down my thoughts about this topic. Specifically, I want to describe some properties of a parser generator that I would enjoy using. Note that this is not an “introduction to parsing” blog post, some prior knowledge is assumed.

Why do I care about this at all? The broad reason is that today a lot of tools and even most editors use regular expressions to approximately parse programming languages, and I find this outright b҉a͡rb̢ari͞c͘. I understand that in practice parsing is not as easy as it is in theory:

Law: You can’t check code you can’t parse. Checking code deeply requires understanding the code’s semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

A few billion lines of code later

However, I do believe we could do better if we use better tools!

The specific reason is that I care way too much about the Rust programming language and

I think today it is the best language for writing compiler-like stuff (yes, better than OCaml!),
I’d love to see an awesome parser generator written in and targeting Rust,
I want to write a Rust parser in a slightly better way. I’ve done it twice already :) (update: thrice)

I’ve used various parser generators, implemented one, fall, and still haven’t met a parser generator that I love.

The post is split into three major chapters:

UX — how to make using a parser generator easy, enjoyable and fun?
API — what API the generated parser should have.
Parsing Techniques — how exactly do we get from text to the parsed tree?

I’ll be using a rather direct and assertive language in the following, but the fact is I am totally not sure about anything written here, and would love to know more about alternatives!

UX

Although this text is written in Emacs, I strongly believe that a semantic-based, reliable, and fast support from tooling is a great boon to learnability and productivity. A great IDE support is a must for a modern parser generator, and this chapter talks mostly about IDE-related features.

The most important productivity boost of a parser generator is the ability to fiddle with grammar interactively. The UI for this might look as a three-pane view, where the grammar is on the first pane, example code to parse is in the second pane and the resulting parse tree is in the third one. Editing first two panes should reactively update the last one. This is difficult to implement with most yacc-like parser generators, I’ll talk more about it in the next section.

The second most important feature is inline tests: for complex grammars it could be really hard to map from a particular rule specification to actual code that is parsed by the rule. Having a test written alongside the rule is invaluable! The test should be just a snippet of code in the target language. The “gold” value of the parse tree for the snippet should be saved in the file alongside the grammar and should be updated automatically when the grammar changes. Having inline tests allows to fit the “three pane UI” from the previous into two panes because you can just use the test as your second pane.

Here’s a video that shows how it works in fall: https://youtu.be/gb1MJnTcvds.

Note that even if you write your parser by hand, you still should use such “inline tests”. To do so, write them as comments with special markers, and write a small script which extracts such comments and turns them into tests proper. Here’s an example from one experimental hand-written parser of mine. Having such examples of “what does this if parses?” greatly simplifies reading of parser’s code!

Here’s the list of important misc IDE features, from super important to very important. They are not specific to parser generators, so, if you are using a parser generator to implement IDE support for your language, look into these first!

Extend selection to the enclosing syntactic structure (and not just to a braced block). A super simple feature, but this combined with multiple cursors is arguably more powerful than vim’s text objects, and most definitely easier to use.
Fuzzy search of symbols in the current file/in the project: super handy for navigation, both more important and easier to implement than goto definition.
Precise syntax highlighting. Highlighting is not a super-important feature and actually works ok even with regex approximations, but if you already have the syntax tree, then why not use it?
Go to definition/find references.
Errors and warnings inline, with fixes if available.
Extract rule refactoring, pairs well with extend selection.
Code formatting.
Smart typing: indenting code on Enter, adding/removing trailing commas when joining/splitting lines, and in general auto magically fixing punctuation.
Code completion: although for parser generators dumb word-based completion tends to work OK.

Here’s a short demo of some of these features in fall: https://youtu.be/WRWmwfBLf7o.

I want to emphasize that most of these features are ridiculously easy to implement, if you have a parse tree for your language. Take, for example, “fuzzy search of symbols in the project”. This is a super awesome feature for navigation. Basically, it is CTAGS done right: first, you parse each file (in parallel) and build a list of symbols for it. Then, as user types, you incrementally update the changed files. Using fall, I’ve implemented this feature for Rust, and it took me three small files:

find_symbols.rs to extract symbols from a single file, 21(!) lines.
indxr.rs, a generic infra to watch files for changes and recompute the index incrementally, 155 lines.
symbol_index.rs glues the previous two together, and adds fst by ever-awesome BurntSushi on top for fuzzy search, 122 lines.

This is actually practical: initial indexing of rust-lang/rust repo takes about 30 seconds using a single core and fall’s ridiculously slow parser, and after that everything just works:

https://youtu.be/KyUUDcnOvUw

A small note on how to pack all this IDE functionality: make a library. That way, anyone could use it anywhere. For example, as a web-assembly module in the online version. On top of the library you could implement whatever protocol you like, Microsoft’s LSP, or some custom one. If you go the protocol-first way, using your code outside of certain editors could be harder.

API

Parse Tree

Traditionally, parser generators work by allowing the user to specify custom code for each rule, which is then copy-pasted into the generated parser. This is typically used to construct an abstract syntax tree, but could be used, for example, to evaluate arithmetic expressions during parsing.

I don’t think this is the right API for the parser generator for three reasons though.

It feels like a layering violation because it allows to intermix parsing with basically everything else. You can literally do code-generation during parsing. It makes things like the lexer hack possible.

It would be very hard to implement reactive rendering of the parse tree if the result of parsing is some user-defined type.

Most importantly, I don’t think that producing abstract syntax tree as a result of parsing is the right choice. The problem with AST is that it, by definition, loses information. The most commonly lost things are whitespace and comments. While they are not important for a command-line batch compiler, they are crucial for IDEs, which work very close to the original source code. Another important IDE-specific aspect is support for incomplete code. If a function is missing a body and a closing parenthesis on the parameter list, it’s still better be recognized as a function. It’s difficult to support such missing pieces in traditional AST.

I am pretty confident that a better API for the generated parser is to produce a parse tree which losslessly represents both the input text and associated tree structure. Losslessness is a very important property: it guarantees that we could implement anything in principle.

I’ve outlined one possible design of such lossless representation in the libsyntax2 RFC, the simplified version looks like this:

struct Kind(u32);

struct Node {
    kind: Kind,
    span: (usize, usize),
    children: Vec<Node>,
}

That is, the result of parsing is a homogeneous tree, with nodes having two bits of information besides the children:

Type of a node: is it a function definition, a parameter, a comment?
Region of the source text covered by the node.

A cool thing about such representation is that every language uses the same type of the syntax tree. In fall features like extend selection are implemented once and work for all languages.

If you need it, you can do the conversion to AST in a separate pass. Alternatively, it’s possible to layer AST on top of the homogeneous tree, using newtype wrappers like

// invariant: Node.kind == STRUCT_DEF
struct StructDef(Node);

// invariant: Node.kind == STRUCT_FIELD
struct StructField(Node);

impl StructDef {
    fn fields(&self) -> Vec<StructField> {
        self.0.children.iter().filer(|c| c.kind == STRUCT_FIELD)
            .map(StructField)
            .collect()
    }
}

Parser generator should automatically generate such AST wrappers. However, it shouldn’t directly infer them from the grammar: not every node kind needs an AST wrapper, and method names are important. Better to let the user specify AST structure separately, and check that AST and parse tree agree. As an example from fall, here is the grammar rule for Rust paths, the corresponding ast definition, and the generated code.

Incremental Reparsing

Another important feature for modern parser generator is support for incremental reparsing, which is obviously useful for IDEs.

One thing that greatly helps here is the split between parser and lexer phases.

It is much simpler (and more efficient) to make lexing incremental. When lexing, almost any change affects at most a couple of tokens, so in theory incremental lexing could be pretty efficient. Beware though that worst-case relexing still has to be linear, because insertion of unclosed quote changes all the following tokens.

In contrast, it is much easier to change tree structure significantly with a small edit, which places upper-bound on incremental reparsing effectiveness. Besides, making parsing incremental is more complicated because you have to deal with trees instead of a linear structure.

An interesting middle ground here is an incremental lexer combined with a fast non-incremental parser.

Lexer

Traditional lex-style lexers struggle with special cases like ml-style properly nested comments or Rust raw literals which are even not context-free. The problem is typically solved by injecting custom code into lexer, which maintains some sort of state, like a nesting level of comments. In my experience, making this work properly is very frustrating.

These two tricks may make writing lexer simpler.

Instead of supporting lexer states and injecting custom code, allow to pair regex, which defines a token, with a function which takes a string slice and outputs usize. If lexer matches such external token, it then calls supplied function to determine the other end of the token. Here’s an example from fall: external token, custom functions.

Often it is better to use layered languages instead of lexer states. Parsing string literals is a great example of this. String literals usually have some notion of a well-formed escape sequence. The traditional approach to parsing string literals is to switch to a separate lexer state after ", which handles escapes. This is bad for error recovery: if there’s a typo in an escape sequence, it should still be possible to recognize literal correctly. So alternative approach is to parse a string literal as, basically, “anything between two quotes”, and then use a separate lexer for escapes specifically later in the compiler pipeline.

Another interesting lexing problem which arises in practice is context-sensitivity: things like contextual keywords or >> can represent different token types, depending on the surrounding code. To deal with this case nicely, the parser should support token remapping. While most of the tokens appear in the final parse tree as is, the parser should be able to, for example, substitute two > > tokens with a single >>, so that later stages of compilation need not to handle this special case.

Parser

A nice trick to make parser more general and fast is not to construct parse tree directly, but emit a stream of events like “start internal node”, “eat token”, “finish internal node”. That way, parsing does not itself allocate and, for example, you can use the stream of events to patch an existing tree, doing minimal allocations. This also divorces the parser from a particular tree structure, so it is easier to plug-in different tree backends.

Events also help with reshuffling the tree structure. For example, during event processing we can turn left-leaning trees to right-leaning ones or flatten them into lists. Another interesting form of tree reshuffling is attachment of comments. If a comment immediately precedes some definition, it should be a part of this definition. This is not specified by the language, but it is the result that human would expect. With events, we can handle only significant tokens to the parser and deal with attaching comments and whitespace when reconstructing tree from a flat list of events.

Miscellaneous concerns

To properly implement incremental reparsing, we should start with a data structure for text which is more efficient to update than String. While we do have quite a few extremely high-quality implementations of ropes, the ecosystem is critically missing a way to talks about them generically. That is, there’s no something like Java’s CharSequence in Rust (which needs a much more involved design in Rust to avoid unnecessary overhead).

Luckily, the parse tree needs to remember only the offsets, so we can avoid hard-coding a particular text representation, and we don’t even need a generic parameter for that.

Homogeneous trees make reactive testing of the grammar possible in theory because you can always produce a text representation of a tree from them. But in practice reactivity requires that “read grammar, compile parser, run it on input” loop is fast. Literally generating source code of the parser and then compiling it would be too slow, so some kind of interpreted mode is required. However, this conflicts with the need to be able to extend lexer with custom code. I don’t know of a great solution here, but something like this would work:

require that all lexer extensions are specified in the verbatim block of the grammar file and don’t have external dependencies,
for IDE support, compile the lexer, and only the lexer, in a temp dir and communicate with it via IPC.

A possible alternative is to use a different, approximate lexer for interactive testing of the grammar. In my experience this makes such testing almost useless because you get different results in interesting cases and interesting cases are what is important for this feature.

In IDEs, a surprisingly complicated problem is managing a list of open and modified files, synchronizing them with the file system, providing consistent file-system snapshots and making sure that things like in-memory buffers are also possible. For parser generators, all this complexity might be dodged by requiring that all of the grammar needs to be specified in a single file.

Parsing Techniques

So we want to write a parser generator that produces lossless parse trees and which has an awesome IDE support. How do we actually parse a text into a tree? Unfortunately, while there are many ways to parse text, there’s no accepted best one. I’ll try to do a broad survey of various options.

I’d love to discuss the challenges of the textbook approach of just using a context-free grammar/BNF notation. However, let’s start with a simpler, “solved” case: regular expressions.

Languages which could be described by regular expressions are called regular. They are exactly the same languages which could be recognized by finite state machines. These two definition mechanisms have nice properties which explain the usefulness of regular languages in real life:

Regular expressions map closely to our thinking and are easy for humans to understand. Note that there are equivalent in power, but much less “natural” meta-languages for describing regular languages: raw finite state machines or regular grammars.
Finite state machines are easy for computers to execute. FSM is just a program which is guaranteed to use constant amount of memory.

Regular languages are rather inexpressive, but they work great for lexers. On the opposite side of expressivity spectrum are Turing machines. For them, we also have a number of meta-languages (like Rust), which work great for humans. It’s interesting that a Turing machine is equivalent to a finite state machine with a pair of stacks: to get two stacks from a tape, cut the tape in half where the head is. Moving the head then corresponds to popping from one stack and pushing to another.

And the context-free languages, which are described by CFGs, are exactly in between languages recognized by finite state machines and languages recognized by Turing machines. You need a push-down automaton, or a state machine with one stack, to recognize a context-free language.

CFGs are powerful enough to describe arbitrary nesting structures and seem to be a good fit for describing programming languages. However, there are a couple of problems with CFGs. Let’s write a grammar for arithmetic expressions with additions, multiplications, parenthesis and numbers. The obvious answer,

E -> E + E | E * E | (E) | number

has a problem. It is under specified and does not tell if 1 + 2 * 3 is (1 + 2) * 3 or 1 + (2 * 3). We need to tweak the grammar to get rid of this ambiguity:

E -> F | E + F
F -> T | F * T
T -> number | (E)

I think the necessity of such transformations is a problem! Humans don’t think like this: it took me three or four courses in formal grammars to really internalize this transformation. And if we look at language references, we’ll typically see a precedence table instead of BNF.

Another problem here is that we even can’t workaround ambiguity by plainly forbidding it: checking if CFG is unambiguous is undecidable.

So CFGs turn out to be much less practical and simple than regular expressions. What options do we have then?

Abandoning CFG

The first choice is to parse something, not necessary a context-free language. A good way to do it is to write a parser by hand. A hand-written parser is usually called a recursive descent parser, but in reality it includes two crucial techniques in addition to just recursive descent. The pure recursive descent works by translating grammar rules like T -> A B into a set of recursive functions:

fn parse_t() {
    parse_a();
    parse_b();
}

The theoretical problem here is that it can’t deal with left-recursion. That is, rules like Statements -> Statements ';' OneStatement make recursive descent parsers loop infinitely. In theory, this problem is solved by rewriting the grammar and eliminating the left recursion. If you had a formal grammars class, you probably have done this! In practice, this is a completely non-existent problem, because we have loops:

fn parse_statements() {
    loop {
        parse_one_statement();
        if !parse_semicolon() {
            break;
        }
    }
}

The next problem with recursive descent is that parsing expressions with precedence requires that weird grammar rewriting. Luckily, there’s a simpler technique to deal with expressions. Suppose you want to parse 1 + 2 * 3. One way to do that would be to parse it with a loop as a list of atoms separated by operators and then reconstruct a tree separately. If you fuse these two stages together, you get a loop, which could recursively call itself and nest, a Pratt parser. Understanding it for the first time is hard, but you only need to do it once :)

The most important feature of hand-written parsers is a great support for error recovery and partial parses. It boils down to two simple tricks.

If you are parsing a homogeneous sequence of things (i.e, you are inside the loop), and the current token does not look like it can begin a new element, you just skip over it and start the next iteration of the loop. Here’s an example from Kotlin. At this line, we’ll get null if current token could not begin a class member declaration. Here we just skip over it.

If you are parsing a particular thing T, and you expect token foo, but see bar, then, roughly:

if bar is not in the FOLLOW(T), you skip over it and emit error,
if bar is in FOLLOW(T), you emit error, but don’t skip the token.

That way, parsing something like

fn foo(

struct S {
   f: u32
}

would correctly recognize incomplete function foo (again, its easier to represent such incomplete function with homogeneous parse trees than with AST), and a complete struct S. Here’s another example from Kotlin.

Although hand-written parsers are good at producing high-quality error messages as well, I don’t think that this is important. In the IDE context, for syntax errors it is much more important and beneficial to get a red squiggly under the error immediately after you’ve typed invalid code. Instantaneous feedback and precise location are, in my personal experience, enough to fix syntax errors. The error message can be just “Syntax error”, and more elaborate messages are often make things worse because mapping from an error message to what is actually wrong is harder than just typing and deleting stuff and checking if it works.

It is possible to simplify authoring of this style of parsers by generating all recursive functions, loop and Pratt parsers from declarative BNF/PEG style description. This is what Grammar Kit and fall do.

Embracing ambiguity

Another choice is to stay within CFG class but avoid dealing with ambiguity by producing all possible parse trees for a given input. This is typically achieved using non-determinism and memorization, using GLR and GLL style techniques.

Here I’d like to call out tree-sitter project, which actually ticks quite a few boxes outlined in this blog post. In particular, it uses homogeneous trees, is fully incremental and has surprisingly good support for error recovery (though not quite as good as hand-written style parsers, at least when I’ve last checked it).

Abandoning generality

Yet another choice is to give up full generality and restrict the parser generator to a subset of unambiguous grammars, for which we actually could verify the absence of ambiguity. This is how traditional parser generators like yacc, happy, menhir or LALRPOP work.

The very important advantage of these parsers is that you get a strong guarantee that the grammar works and does not have nasty surprises. The price you have to pay, though, is that sometimes it is necessary to tweak an already unambiguous grammar to make the stupid tool understand that there’s no ambiguity.

I also haven’t seen deterministic LR parsers with great support for error recovery, but looks like it should be possible in theory? Recursive descent parsers, which are more or less LL(1), recover from errors splendidly, and LR(1) has strictly more information than an LL(1) one.

So, what is the best choice for writing a parser/parser generator?

It seems to me that the two extremes are the most promising: hand written parser gives you utmost control over everything, which is important when you need to parse some language, not designed by you, which is hostile to the usual parsing techniques. On the other hand, classical LR-style parsers give you a proof that the grammar is unambiguous, which is very useful if you are creating your own language. Ultimately, I think that being able to produce lossless parse trees supporting partial parses is more important than any particular parsing technique, so perhaps supporting both approaches with a single API is the right choice?

Conclusion

This turned out to be a quite lengthy post, hope it was interesting! These are the main points:

IDE support is important, for the parser generator itself as well as for the target language.
Lossless parse trees are more general than ASTs and custom action code, and are a better fit for IDEs.
Interactivity matters! Reactive grammar repl and inline tests rock!
Parsing is an unsolved problem :)

Discussion on /r/rust.