Layout parsing with context-free grammars

Careless use of feedback in parsing can destroy an important and useful property in context-free grammars, that the grammars are generative.

I just upgraded the parser and grammar in Lever programming language with layout constraints to fix this problem.

How the preprocessing step was done

Before the upgrade, Lever's indentation sensitivity was provided by a preprocessing step after a tokenizer. For each token it was doing the following steps before feeding the token into the parser.

if at_start_of_line
    while col < level and expecting(dedent)
        step(dedent)
        level = indent_stack.pop()
    if col < level
        raise SyntaxError("Uneven indent")
    if col == level and expecting(newline)
        step(newline)
    if col > level and expecting(indent)
        step(indent)
        indent_stack.append(level)
        level = col

if not expecting(next_term) and can_close(next_term)
    while expecting(dedent)
        step(dedent)
        level = indent_stack.pop()

step(next_term)

This program inserts indent, dedent and newline tokens whenever the parser expects to see them.

The second clause closes indentation blocks prematurely whenever the next token is a closing parenthesis that doesn't fit inside the block.

The expecting(token) determines whether the parser can accept the upcoming token. It provided feedback from the parser.

This way of implementing indentation sensitivity is problematic. It prioritizes grammar interpretations that contain indent/dedent/newline tokens over those that do not have them.

The behavior of the whole language becomes very difficult to explain because the generativity of the grammar is lost. We'll discuss generativity further below.

Generative grammars

Context-free grammars are generative. This is a pleasant property that can be hard to see behind shift/reduce and reduce/reduce conflicts.

The term "generativity" means that when you see a production rule in a grammar, then the items matching that production exist in the language formed by the grammar.

For example, in CSS you have declarations that could be described with the following production:

declaration: property ":" value

This tells you that a valid declaration might seem something like width: 50ex.

You may follow higher where the declaration is allowed to occur in the language and see that..

declarations_block:
    "{" declarations ";"? "}"

declarations:
    declaration
    declarations ";" declaration

This tells you that the declarations may occur in blocks like this:

{ width: 50ex; font-size: 120% }

The grammar also reveals that the last semicolon before right brace is optional.

If you still follow it up higher you will find out that..

rule: selector_group declarations_block

If you've studied what a selector group is, then you are able to create full CSS rules:

#main_section { width: 50ex; font-size: 120% }

And finally you see that.

file: statement*

statement:
    rule
    at_rules

With this you know that this is a syntactically correct CSS file:

#foo .bar {
    flubber: 1;
}

.fiba { fuba: fub; gab-fip: none }

It is extremely easy to explain what a CSS file is and what does it look like when there is a generative grammar it obeys.

Therefore if a grammar is generative, it also works as a form of documentation when supplied with enough useful information. But it is also an usable implementation of a parser of the language when fed into a suitable parsing engine.

When generativity is broken, it means that you can no longer trust that certain production always means something specific when it appears in correct place.

file: some_statement*

some_statement:
    group_statement
    statement

Next you see a text like this:

statement
  statement

You would think this is two separate statements in a file, based on what you've read so far. Next you see this rule:

group_statement: statement indent statement dedent

With the indent and dedent symbols this production rule means that a statement indented after one should group to the previous statement.

An astute reader notes that this can't be. The language that was described above is ambiguous if the indent and dedent are just tokens you could remove and insert into certain places.

The problem is that these layout tokens are inserted when they are expected. Therefore the indented statement ends up being parsed as a group of statements, even if otherwise it'd be perfectly valid two separate statements.

This is a problem because if the language was complex enough, then the entirely arbitrary rules could take precedence and mask out otherwise ambiguous expressions to mean something else than what it says in the grammar. Exactly what happened in the statement example.

Previous work

When I was looking for a replacement for my parsing of layout, I learnt about these two papers:

  1. "Principled Parsing for Indentation-Sensitive Languages" by Michael D. Adams
  2. "Layout-sensitive Generalized Parsing" by Erdweg, Rendel, Kästner, Ostermann.

I ended up implementing something similar to these both papers, but I found the extended grammars by Adams to be extremely confusing.

For example, look at this rule from the paper:

case: "case"{>} exp{=} "of"{>} altblock{=}

According to the paper, this means that the 'case' and 'of' tokens are above the indentation level of exp, altblock and case, which share the same indentation level.

My approach is not as expressive as this, but it seems to have worked really well so far.

Layout constraints

I decided to implement a variation where every constraint is relative to the first symbol in the production rule where the constraint appears.

There are four constraints with roughly the following meanings:

  1. ^ the next symbol must start a new line
  2. ! the next symbol must not start a new line
  3. > the next symbol must be right of from the indentation depth of the first symbol in the production rule.
  4. = the next symbol must be vertically aligned with the first symbol in the production rule.

The indentation depth is always measured from the beginning of the line where the first symbol of the production rule appears.

The language described earlier would be written like this with using these constraints:

file: =^some_statement*

some_statement:
    group_statement
    statement

group_statement: statement >^statement

Next if you see text like this:

statement
  statement

You know that it's not two statements because two statements must align vertically. On the other hand the layout satisfies that of the group statement. The generativity to the grammar is restored. If the clause appears in the grammar then it exists in the language.

Upgrade

The implementation of this kind of constraints ended up being easy. When I implemented this one it also ended up exposing a bug in one of the optimization steps of my parsing library.

The extensions into the parser would be worth its own article, but in short to get an Earley-style parser to parse it, you want to prevent shift on an earley item when the expression doesn't satisfy the layout constraint. Additionally little bit of depth-first-search needs to be done for constraints that appear on nonterminals.

When I wrote the layout constraints I left in the old method indentation intact. I will keep it in there until the new system does the things the old system did.

I don't intend to do large modifications to the syntax of the language at this time, so I scanned every file written in lever and made a test that checks that every file parses correctly after the changes.

Consequences

This modification requires large changes into the grammar of the language. I think it helps the language to become much cleaner and easier to learn as I realise what the real grammar of this language is supposed to be.

Function blocks appeared to be in wrong grammatic group because ((x): a(x)) is supposed to form a function block, rather than a function that is immediately called with argument x.

One highlight is that within parenthesized blocks, I think I want to pick the style that you can omit a comma if you align the lines vertically. The following grammar describes that syntax for dictionary objects:

term: "{" pairs "}"

pairs:
    []
    pairs1
    pairs1 ','

pairs1:
    pair
    pairs1  ','  >pair
    ^pairs1 ',' ^=pair
    ^pairs1     ^=pair

pair:
    expr ":" expr
    symbol "=" expr

I enjoy how clean that rule ended up to be.

At the time of writing this I still have handful files that fail to parse under the new grammar. All of them will probably give some ideas on how to improve the grammar.

Ambiguity

If you've only used LR parser generator, you are likely to believe that the ambiguity in a grammar is evil. PEG parsing engines seem to follow the same line of thought by eliminating the ambiguity entirely by design.

Especially LR conditions people to believe that ambiguity specifically is bad because it's the thing they have to remove before the LR parser generator accepts the grammar.

Ambiguity is not a source of much drama if you use a parser that can parse with ambiguous grammars though. Strings that are ambiguous end up just being reported as ambiguous and I get an opportunity to improve the syntax if the input string shouldn't have multiple different meanings.

Next steps

I think this blog post is a bit incomplete and I'm still ironing all the wrinkles that were exposed. I also would have hoped to produce some sort of concealed call-to-action here for people who get interested about trying out my programming language.

Unfortunately you guys have to hold out on your own! The going with new users must go a bit reptilian if there is only one active member in the project. I don't mind though because Lever has shown to me that it's not yet really the language that I ultimately want.

It's still not ready for standardization, and definitely not ready for production use.

Similar posts