LR(1) parsers

The first time I encountered LR parsing, I didn't understood how they work, or why are they valuable for programming language research. Last week I decided to study them because I needed some perspective for creating my own parsers.

I've noticed the shortest, simple to understand compilers are created when you don't need to lug parse trees around. Instead of having parse trees you're directly preparing the compilation structures or evaluating the code when the parser works through the tokens.

What is a LR parser?

LR parser consists of a stack, a lookup table and a single command that takes a token in:

function step(token)
    action = table[stack.top.state][token.name]
    while action is reduce(n)
        pop length(n) values from the stack
        combine them with the rule n
        push that and table.goto[stack.top.state][n] into stack
        action = table[stack.top.state][token.name]
    if action is accept
        return value from the stack
    else if action is shift(n)
        push token and n into stack
    otherwise
        report an error

If the parser tables aren't broken, the above program always holds something in it's stack. It's initialized with the zero state in the top of the stack. After a successful parse, another input can be parsed.

The parser table is divided into two main columns, action and goto. The action table is used to take an action, one of four: reduce(n), shift(n), accept, error. The goto table determines which state to push in after a reduce operation.

Deterministic context-free grammar

You can construct the parser table in various ways, but usually it's created from a deterministic context-free grammar. Deterministic means that it can be accepted by a deterministic pushdown automaton, such as the one above. Context-free grammar is a sort of grammar where every production rule is of form:

V -> w
V is a (nonterminal) symbol
w is a sequence of symbols

Whenever a sequence of symbols is found, it can be rewritten as V symbol independent of the context. Here's an example grammar:

S -> N + N
N -> 1
N -> 2

And here's the corresponding parser table for the LR parser.

| action               | goto   |
|----------------------|--------|
|   | 1 | +  | 2 | end | S  | N |
|---|---|----|---|-----|----|---|
| 0 | 1 |    | 2 |     | 8  | 3 |
| 1 |   | r1 |   |     |    |   |
| 2 |   | r2 |   |     |    |   |
| 3 |   | 4  |   |     |    |   |
| 4 | 5 |    | 6 |     |    | 7 |
| 5 |   |    |   | r1  |    |   |
| 6 |   |    |   | r2  |    |   |
| 7 |   |    |   |     | r0 |   |
| 8 |   |    |   | acc |    |   |

Shift/reduce and reduce/reduce errors

If you have a deterministic context-free language, you can parse it with a LR parser. Otherwise your parser table generator produces shift/reduce and reduce/reduce errors. These errors may let you moaning and screeching in terror on the floor if you don't understand them. They are fortunately, not very magical error messages though. They mean that you've got lockpiece of rules somewhere, that are ambiguous for the LR parser.

Shift/reduce means that shift and reduce actions would allocate the same cell in the parser table to produce a working parser. It cannot be determined from the lookahead token whether one should shift the token or reduce with a rule that could be already matched.

Reduce/Reduce means that there are two reduction rules that would allocate a cell in the parser table. The same input sequence matches with two different rules.

How to generate the parser tables

The parser tables are generated by creating item sets from the grammar rules. Each item set will become an unique row in the table. I describe roughly how to build a canonical LR(1) parser table. They are largest of parser tables but they halt before they run reduces on the token that produced the error.

The parser tables are generated by simulating every state where the parser may be in. This is achieved by item sets. Item set is of following form:

V -> p . w {n}

p is sequence of symbols shifted to complete the reduction rule
w is sequence of symbols that complete the reduction rule
{n} is set of terminal symbols that appear after this rule.

Only the reachable rules are useful, so we introduce the accept items, when this item reduces, it writes accept into the cells with the set of terminal symbols it had. We create a kernel item set out of them. This is the first state and it determines which language our parser accepts:

acc -> . {$}
acc -> . sym1 {$}
acc -> . sym2 {$}

This is the first kernel item set. Next operation is called closure. We take a closure out of the item set we just made:

acc  -> . {$}
acc  -> . sym1 {$}
sym1 -> . sym3 sym4 {$}
sym3 -> . sym5 (first(sym4))
acc  -> . sym2 {$}
sym2 -> . sym7 sym8 sym9 {$}
sym2 -> . sym5 {$}

The first(sym4) represents a set of nonterminal symbols that sym4 may begin with. Next we group these items by the next symbol that completes the reduction rule:

reductions
    acc  -> . {$}

shift sym1
    acc  -> . sym1 {$}

shift sym2
    acc  -> . sym2 {$}

shift sym3
    sym1 -> . sym3 sym4 {$}

shift sym5
    sym3 -> . sym5 (first(sym4))
    sym2 -> . sym5 {$}

shift sym7
    sym2 -> . sym7 sym8 sym9 {$}

These are the kernel item sets where you can progress from the first state. They are appended into the table after it's checked that they already aren't there. The groups are translated into shifts and reduce cells on this row.

Since the row is written effectively at this point, and not edited later, you can see these item sets already show whether they have shift/reduce or reduce/reduce errors. If you're filling the row, just check whether the cell contains things before filling it up.

Because you're filling the table up by appending the newfound states. You will fill the table by repeating what you did for the first kernel item set for the subsequent kernel item sets. In summary: closure -> group -> match -> fill up table.

Finally

I still need some experience from LR parsers. For example, the parsing table generation for my language seems to take seconds on my intel Q9550. I'm not sure whether generating parsing tables just consume lot of time, or if I've done something wrong.

To some extent, restricting the grammar in my programming language to LR(1) parsing seems to expose the most unreadable structures of my language, forcing me to locate and cover the corner cases. This far my parser just shows shift/reduce reduce/reduce errors for the whole table. I guess improving the error messages of my parser might let me shift finding the corner cases for the parser generator.

For a someone learning the language, it might be good that the parser is analysing the code with the smallest piece first, from left to right. It matches the thing what happens when you're concentrating to extract meaning.

In other hand, if one is interested to parse the language like it's read. It might be useful to study focals, as experienced people might use those to read. Focal is an unit of sight. Pair of parentheses form a focal, as well as other symbols or keywords. The different focals have a strength, and whatever comes around them defines what there is.