Extending Pratt parser with layout sensitivity

This post shortly describes rules for extending a top-down precedence parser with vertical stacking, rendering the input language layout sensitive.

Algorithm

When an expression with precedence '0' starts a line, it leaves a mark on it's column. The mark is active when the expression is complete and the next token is below the column line. The mark overrides application of the left-denotation rule in the expression parsing function to produce vertical stacking expressions (a ¶ b). Vertical stacking is attempted if the next token is on the same column, otherwise the expression under the mark is completed.

Examples

I provide you some examples to understand what I mean. Here's a simple mathematical formula that follows the precedence rules that you would expect from it.

a + b * c = d

The expression starting on this line holds precedence of '0', so we annotate it with a mark.

a + b * c = d       {¶ 0}
─────────────────────────
(a + (b * c)) = d

Below the horizontal line remains the result of our parsing.

Now lets split the expression to multiple lines.

a + b *             {¶ 0}
c = d
─────────────────────────
(a + (b * c)) = d

This clause is equivalent to the previous one, because the expression on the line is not complete and the mark does not activate.

If we split it in a different way, we will end up with this:

a                   {¶ 0}
+ b * c = d         {¶ 0}
─────────────────────────
a ¶ ((+ (b * c)) = d)

The vertical stacking expressions assume right-associative precedence. Preferably the operator bound to vertical stacking is associative so that we don't need to care about that detail, but lets decide on this early on.

The another rule that prevents a mark from activating is the column positioning of the next line.

a                   {¶ 0}
  + b * c = d
─────────────────────────
(a + (b * c)) = d

The '+' token starting the next line was above the column line, therefore the mark did not apply. The second line continues the previous expression, so it does not form a new mark. However it would not leave a mark anyway, because the precedence of the starting expression has the rbp(+) precedence that is not precedence '0'.

a +                 {¶ 0}
  b * c = d
─────────────────────────
(a + (b * c)) = d

To activate vertical stacking, you have to construct a starting expression that has the zero precedence.

a + (               {¶ 0}
  b                 {¶ 2}
  c                 {¶ 2}
  d )               {¶ 2 ⊥ }
────────────────────────────
a + (b ¶ (c ¶ d))

The deeper stacking rule terminates because the expression is completed without vertical stacking. Of course it could be also completed through the stacking mechanism itself.

a + (               {¶ 0}
  b                 {¶ 2}
  c                 {¶ 2}
  d                 {¶ 2}
)
─────────────────────────
a + (b ¶ (c ¶ d))

The right parenthesis is needed to complete the expression on the first rule, so the mark does not interfere with it's placement.

There are no further examples that'd show some functionality that you've not already seen. Pratt parsing produces a lean and simple tool to parse your language if you treat the whole input language to consist of formulas that you can stack.

Here are few valid ways to represent conditional in such formulaic ways, with their derivations represented aside them.

if x {
    a
    b
    c
}
────────────────────
if(x, (a ¶ (b ¶ c)))

if x then
    a
    b
    c
else y
───────────────────────
if(x, (a ¶ (b ¶ c)), y)

Finally if you wonder what happens with dangling else case, the pratt parser already takes care of that. The formula after the 'then' -token would be zero-precedenced.

if x then if y then z else w
────────────────────────────
if(x, if(y, z, w))

Treatment of every expression as a formula is usually reserved for toy languages. Perhaps the applicability of such languages is something to be reconsidered.

Behavior under bad layout

But is this something you could take seriously? If we are given a badly layouted input formula we would prefer to produce an error on that. The layouting algorithm does not specify any failure mode though. Those come from the precedence parser and are same as in the original parsing algorithm.

Surprisingly this is sufficient, because the grammar can deny the possibility for the vertical stacking if the stacks do not align. Here's what it means in practice:

a + (               {¶ 0}
  b                 {¶ 2}
 c
  d )
────────────────────────────────
parsing error at the third line,
parenthesis doesn't close.

If we have an ALGOL-style grammar that requires a an operator such as comma between two separate expressions, then the failure to vertically stack the formulas will inevitably result in the failure of the parsing in whole.

Implementation

Here's a piece from the Javascript parser. You can find the whole source behind the first link in this post. This is the function that needs to be modified to implement vertical stacking:

var expression = function (rbp) {
    var left;
    var t = token;
    advance();
    left = t.nud();
    while (rbp < token.lbp) {
        t = token;
        advance();
        left = t.led(left);
    }
    return left;
}

The function needs to implement the constraints that we have set upon the application of the vertical stacking rule. I think this would do the trick.

var expression = function (rbp, vcol) {
    var left;
    var t = token;
    var stacking = (rbp == 0 && at_line_begin());
    if (stacking) {
        vcol = token.col;
    }
    advance();
    left = t.nud();
    while (rbp < token.lbp && !(at_line_begin() && token.col <= vcol)) {
        t = token;
        advance();
        left = t.led(left);
    }
    if (stacking && at_line_begin() && token.col == vcol) {
        left = vstack(left, expression(rbp, vcol));
    }
    return left;
}

The 'vcol' variable is the current vertical column. We would parse the whole input file with the expression(0, 0). Additionally you would have to introduce the .col, at_line_begin() function and the vstack(l, r) constructor yourself to complete the design.

It seems simpler than I thought, which means I probably did something wrong somewhere.

Refinement rule for right-precedence

When I was trying out the syntax, I found out a situation where we need an additional rule. Consider the following formula:

a:                  {¶ 0}
b                   {¶ 0}
c
─────────────────────────
a:(b¶c) ∨ (a:b)¶c

In this kind of a situation we have two marks that are active at the same time, but we may have it worse:

a:                  {¶ 0}
  b:                {¶ 2}
c                   {¶ 0}
  d
─────────────────────────
a:((b:c)¶d)

Therefore we need a refinement to the first rule: When an expression with precedence '0' starts a line, it leaves a mark on it's column, but only if a mark cannot be left into a higher column, such that it is active at the same time as this mark.

Conclusion

I haven't fully tested the program yet, so it may be too early to draw conclusions.

Motivation

Before this I've figured out some nice ways to parse code, but a complex parser makes a language that is hard to port outside of its own little environment. There's value in simple ways to parse expressions if they could be made sufficient.

I think that the layout for the code conveys valuable information. If the computer is not reading the layout then it has to obtain that information through some other means. The result is language that is cumbersome to read and modify because of the separators that could have been left out if the layout had mattered.

How did I came up with these rules? Through a little bit of experimentation and prior experience.

The ideas to orient layout parsing around the column and beginning of the line are due to the graphical observations of reading layout. I had used them in a general purpose parser before this.

Introduction of a grammatic rule with adjacent expressions always cause ambiguity between prefix and binary operators that use the same symbol. Positioning of elements vertically can behave like a separator.

When exploring the results of these existing rules I had found out, I concluded that the vertical stacking would require invariance in precedence of things that are vertically stacked over each other. Otherwise the first expression in the vertical stack would have different precedence than the others. The sensible precedence for all elements in a stack would be '0'. When discovering this I first tried to zero the precedence when a layout mark is set.

a *
  b + 1
  b + 2
  b + 3
─────────────────────────────
a * ((b+1) ¶ ((b+1) ¶ (b+1)))

This would give mathematical expressions somewhat surprising behavior when they are layouted. The behavior would eventually make sense to someone, but I've seen it before. I will eventually trip to the odd rule myself and end up thinking of alternatives.

Finally I realised that I don't need to change the precedence for an expression if I instead constraint the layout marks by the precedence. When that last remaining problem was solved, I realised it looks rather promising.