Regular, Recursive, Restricted

A post/question about formal grammars, wherein I search for a good formalism for describing infix expressions.

Problem statement: its hard to describe arithmetic expressions in a way that:

Lets start with the following grammar for arithmetic expressions:

Expr =
    'number'
  | '(' Expr ')'
  | Expr '+' Expr
  | Expr '*' Expr

It is definitely declarative and obvious. But it is ambiguous it doesnt tell whether * or + binds tighter, and their associativity. You can express those properties directly in the grammar:

Expr =
    Factor
  | Expr '+' Factor

Factor =
    Atom
  | Factor '*' Atom

Atom = 'number' | '(' Expr ')'

But at this point we lose decorativeness. The way my brain parses the above grammar is by pattern matching it as a grammar for infix expressions and folding it back to the initial compressed form, not by reading the grammar rules as written.

To go in another direction, you can define ambiguity away and get parsing expression grammars:

Exp =
    Sum
  / Product
  / Atom

Sum     = Expr (('+' / '-') Expr)+
Product = Expr (('*' / '/') Expr)+

Atom = 'number' | '(' Expr ')'

This captures precedence mostly declaratively: we first match Sum, and, failing that, match Product. But the clarity of semantics is lost PEGs are never ambiguous by virtue of always picking the first alternative, so its too easy to introduce an unintended ambiguity.

Can we have both? Clarity with respect to tree shape and clarity with respect to ambiguity?

Let me present a formalism that, I think, ticks both boxes for the toy example and pose a question of whether it generalizes.


Running example:

E =
    'number'
  | '(' E ')'
  | E '+' E

As a grammar for strings, it is ambiguous. There are two parse trees for 1 + 2 + 3 thecorrect one (1 + 2) + 3, and the alternative: 1 + (2 + 3).

Instead, lets see it as a grammar for trees instead. Specifically, trees where:

For trees, this is a perfectly fine grammar! Given a labeled tree, its trivial to check whether it matches the grammar: for each node, you can directly match the regular expression. Theres also no meaningful ambiguity while arbitrary regular expressions can be ambiguous (aa | a*), this doesnt really come up as harmful in practice all that often, and, in any case, its easy to check that any two regular alternatives are disjoint (intersect the two automata, minimize the result, check if it is empty).

As a grammar for trees, it has the following property: there are two distinct trees which nevertheless share the same sequence of leaves:

        E                  E
        o                  o
      / | \              / | \
     E '+' E            E '+' E
     o     |            |     o
   / | \  '3'          '1'  / | \
  E '+' E                  E '+' E
  |     |                  |     |
 '1'   '2'                '2'   '3'

So lets restrict the set of trees, in the most straightforward manner, by adding some inequalities:

E =
    'number'
  | '(' E ')'
  | E '+' E

E !=
    E '+' [E '+' E]

Here, square brackets denote a child. E '+' [E '+' E] is a plus node whose right child is also a plus node. Checking whether a tree conform to this modified set of rules is easy as negative rules are also just regular expressions. Well, I think you need some fiddling here, as, as written, a negative rule matches two different levels of the tree, but you can flatten both the rule and the actual tree to the grandchildren level by enclosing children in parenthesis. Let me show an example:

We want to match this node:

    E
    o
  / | \
 E '+' E
 |     o
'1'  / | \
    E '+' E

against this rule concerning children and grand children:

E '+' [E '+' E]

We write the list of children and grandchidren of the node, while adding extra [], to get this string:

['1'] '+' [E '+' E]

And in the rule we replace top-level non-terminals with [.*], to get this regular expression:

[.*] '+' [E '+' E]

Now we can match the string against a regex, get a mach, and rule out the tree (remember, this is !=).

So here it is, a perfectly functional mathematical animal: recursive restricted regular expression:

This construction denotes a set of labeled trees, where interior nodes are labeled with N, leaves are labeled with T and for each interior node

And the main question one would have, if confronted with a specimen, is is it ambiguous? That is, are there two trees in the set which have the same sequence of leaves?

Lets look at an example:

Expr =
    'number'
  | '(' Expr ')'
  | Expr '+' Expr
  | Expr '*' Expr

Expr !=
             Expr '+' [Expr '+' Expr]
|            Expr '*' [Expr '*' Expr]
|            Expr '*' [Expr '+' Expr]
| [Expr '+' Expr] '*' Expr

It looks unambiguous to me! And I am pretty sure that I can prove, by hand, that it is in fact unambiguous (well, I might discover that I miss a couple of restrictions in process, but it feels like it should work in principle). The question is, can a computer take an arbitrary recursive restricted regular expression and tell me that its unambiguous, or, failing that, provide a counter-example?

In the general case, the answer is no this is at least as expressive as CFG, and ambiguity of arbitrary CFG is undecidable. But perhaps theres some reasonable set of restrictions under which it is in fact possible to prove the absence of ambiguity?