How to write your own LR(1) parsing tables (and generate them)

There are few reasons why you might like to consider rolling your own parsing tables.

  1. LR parsers are structurally quite simple and easy to understand.
  2. It may help you figure out what's going wrong if you ever use a parser generator.
  3. It can be a quite easy way to get a parser if you otherwise don't want to use any parsing tools in your programming environment.

Warning: I found some bugs in the code, There's corrected code in the next post.

Prerequisites

To follow through here, you are going to need...

  1. A text editor
  2. A grammar, I am going to use an interesting fragment of Ratstail91's Toy language because he supplied a context-free grammar for it.

That's all you need. Though if you want to implement a parser with your parsing tables, you also need a programming language and tools to run programs in your language. We will get to that later.

What is a context-free grammar?

We work with context-free grammars and I assume you already know what they are, but if you don't here's a short explanation.

A context-free grammar is a bunch of rules to form sentences. The way they work is that group of symbols may stand for other symbols. For example, if in the grammar we have:

breakStmt → "break" ";"

This means that if we need to produce a 'breakStmt', we can form it from "break" and ";" -symbols.

Author of the grammar has helpfully separated the lexemes of the language by wrapping them into parentheses. These are small units that make the actual sentence. It's not always necessary to separate the lexemes from symbols, but LR(1) parser benefits from knowing which symbols are lexemes.

Additionally the grammar uses some small symbols to succinctly express the production rules and the language. We have stars, question marks, parentheses wrapped around symbols, vertical columns. These have fairly standard meanings and I open them up as we go.

What is a LR parser?

LR parser is a program that recognizes and parses such sentences as presented above. I'd assume you know this too.

LR parser is constructed from a pushdown-automaton Pushdown-automaton is a finite-state machine that employs a stack.

LR parser does either shift or reduce depending on state and input.

  1. Shift takes a symbol from the input stream adds the symbol and current state into the stack then advances to another state.
  2. Reduce takes 'n' items from the stack, restores the state recorded into the stack, then puts a new symbol to the front of the input stream.

It may sound very remote from what you think would be involved in recognizing sentences of some language, but that's sufficient for some grammars.

If you can take your grammar and construct a decision table for the LR parser, then you have a grammar that can be parsed with a LR parser. If you don't have such grammar then you have shift/reduce and reduce/reduce conflicts. I will explain those in a moment.

Dotted rules, or "items"

There's a way to split the parsing of context-free grammars into smaller steps. This happens by tracking where we "are" in the grammar during parsing. We can accomplish this by moving dots inside production rules of the grammar. When you start parsing some rule you give it a dot before the first symbol that we expect to see, like this:

breakStmt → ∘ "break" ";"

If we've already seen the "break" -symbol in the input, the dot has moved to the right side of it:

breakStmt → "break" ∘ ";"

And if Thomas has seen everything from the crime scene, planning for a swift escape through the rail system, we have:

breakStmt → "break" ";" ∘

This kind of an item is usually called "reduction" item whereas the kind where the dot is on the first location is called "prediction" item. Of course you may also start parsing from the right side in which case these terms are flipped around and you get different parsing tables.

The main event

I've explained what we are working with, so it's time to get on the main event of this post. Lets create those decision tables.

We start with a prediction item for what we're expecting to parse:

⊤ → ∘ program

Then you look at the thing on the right side of the dot. If there's a rule to produce it, you predict that rule is needed and add it under the item:

0: ⊤ → ∘ program
   program → ∘
   program → ∘ program declaration

The program-rule had a star in it, this commonly means the item coming after the star repeats zero or more times. I rewrote the star away from the grammar and got the above two rules. Note that although the second item is empty, we don't need to do anything special here. The first set of items is complete and now we shift with each symbol that appears on the right side of a dot, ending up with this:

1: ⊤ → program ∘
   program → program ∘ declaration

Each item set is numbered, the first set was 0, this is 1. Now we have to predict that we parse declarations here:

1: ⊤ → program ∘
   program → program ∘ declaration 
   declaration → ∘ varDecl
   declaration → ∘ constDecl
   declaration → ∘ statement

Hopefully you now understand what the vertical columns mean in the grammar. It's a convenience to describe multiple production rules at once.

We would repeat prediction for every symbol we have. For now I leave this here, lets assume 'varDecl', 'constDecl' and 'statement' are our lexemes.

There are bunch of symbols we can shift with. Shifting each would produce the following further item sets:

2: program → program declaration ∘

3: declaration → varDecl ∘

4: declaration → constDecl ∘

5: declaration → statement ∘

If we get identical initial item sets, we may collapse them together. This means we eventually end up having multiple shifts. By the way record those shifts:

0: program:     Shift 1
1: declaration: Shift 2
   varDecl:     Shift 3
   constDecl:   Shift 4
   statement:   Shift 5
2:
3:
4:
5:

That is actually the shifting part of the decision table. Next we need to consider what the reduction part is. We can retrieve the following reduction items from the grammar:

0: program → ∘
1: ⊤ → program ∘
2: program → program declaration ∘
3: declaration → varDecl ∘
4: declaration → constDecl ∘
5: declaration → statement ∘

There may be multiple reduction items in an item set.

If we can't determine which rule to reduce with, we get a reduce/reduce conflict. Likewise if we can't determine whether to reduce or shift, we get a shift/reduce conflict.

Now if we left this to here, we'd have a shift/reduce conflict in the item sets 0 and 1. Otherwise the table would look like this:

0: program:     Shift 1 / Reduce 0 program
1: declaration: Shift 2 / Reduce 1 ⊤ 
   varDecl:     Shift 3 / Reduce 1 ⊤ 
   constDecl:   Shift 4 / Reduce 1 ⊤ 
   statement:   Shift 5 / Reduce 1 ⊤ 
2: Reduce 2 program
3: Reduce 1 declaration
4: Reduce 1 declaration
5: Reduce 1 declaration

The 'Reduce' -record contains how many items to remove and the shifting symbol. Practically you might add a number to the reduction rule that is applied, but this is sufficient.

Next we can try to get rid of the shift/reduction issues. This happens through constructing FIRST/FOLLOW -sets for conflicts. Basically the point is to find out whether we can carve a hole for the shifting table.

FIRST(symbol) -set

The first task is to determine which symbols may start with which lexemes. We would conclude the following:

FIRST(⊤) = {}
FIRST(program) = {}
FIRST(declaration) = {varDecl,constDecl,statement}
FIRST(varDecl) = {varDecl}
FIRST(constDecl) = {constDecl}
FIRST(statement) = {statement}

If you don't know which symbols are lexemes appearing in the input stream, you got to assume all of them are.

FOLLOW(itemset, symbol) -set

Second we have to trace which lexemes can appear after the symbol has been shifted. In the very first state, we have declaration and nothing appearing after a program symbol.

0: ⊤ → ∘ program
   program → ∘
   program → ∘ program declaration

And therefore:

FOLLOW(0, program) = {Ø1,varDecl,constDecl,statement}

Note we have marked out there can be nothing following a program, and which rule in the set it was, those are details we need to know as well.

Here are the remaining pieces of the 'FOLLOW' -set.

FOLLOW(1, declaration) = {Ø1}
FOLLOW(1, varDecl) = {Ø2}
FOLLOW(1, constDecl) = {Ø3}
FOLLOW(1, statement) = {Ø4}

And that's all what we have for the small fragment of grammar we picked.

Resolving the decision table

Now we have the information we need to resolve conflicts in the earlier table. We start by declaring that after the first state we expect to see the input stream end:

0: ⊤ → ∘ program {$}
   program → ∘ {$,varDecl,constDecl,statement}
   program → ∘ program declaration {$,varDecl,constDecl,statement}

The sets in end of each item are lookahead sets and describe what is expected after the rule. There are bit of shenanigans going on here because we have to predict the program rule follows itself.

Now we can conclude the first state is:

0: $:           Reduce 0 program
   varDecl:     Reduce 0 program
   constDecl:   Reduce 0 program
   statement:   Reduce 0 program
   program:     Shift 1

We do the shift like before:

1: ⊤ → program ∘ {$}
   program → program ∘ declaration {$,varDecl,constDecl,statement}
   declaration → ∘ varDecl {$,varDecl,constDecl,statement}
   declaration → ∘ constDecl {$,varDecl,constDecl,statement}
   declaration → ∘ statement {$,varDecl,constDecl,statement}

Note the predictions are done the same way as before, except that they get their own lookahead sets. Now we know that reductions do not collide with shifting anywhere:

1: $:           Reduce 1 ⊤ 
   declaration: Shift 2
   varDecl:     Shift 3
   constDecl:   Shift 4
   statement:   Shift 5

And we can do the same as before:

2: program → program declaration ∘ {$,varDecl,constDecl,statement}
3: declaration → varDecl ∘ {$,varDecl,constDecl,statement}
4: declaration → constDecl ∘ {$,varDecl,constDecl,statement}
5: declaration → statement ∘ {$,varDecl,constDecl,statement}

This allows to conclude that 'Reduce 0 program' is only needed at the end of the input stream.

Since we shift to the state 1 with program, we'd like to know what we are expecting after the 'program'. This is revealed by the FOLLOW(0,program) and it tells we got empty state and the declaration items. The empty state is substituted by the lookahead set in the current state.

0: $:           Reduce 0 program
   varDecl:     Reduce 0 program
   constDecl:   Reduce 0 program
   statement:   Reduce 0 program
   program:     Shift 1
1: $:           Reduce 1 ⊤ 
   declaration: Shift 2
   varDecl:     Shift 3
   constDecl:   Shift 4
   statement:   Shift 5
2: $:           Reduce 2 program
   varDecl:     Reduce 2 program
   constDecl:   Reduce 2 program
   statement:   Reduce 2 program
3: $:           Reduce 1 declaration
   varDecl:     Reduce 1 declaration
   constDecl:   Reduce 1 declaration
   statement:   Reduce 1 declaration
4: $:           Reduce 1 declaration
   varDecl:     Reduce 1 declaration
   constDecl:   Reduce 1 declaration
   statement:   Reduce 1 declaration
5: $:           Reduce 1 declaration
   varDecl:     Reduce 1 declaration
   constDecl:   Reduce 1 declaration
   statement:   Reduce 1 declaration

Now you have your parser's decision table there. This concludes the main content in this guide.

Okay, how to build a parser around that?

To make a parser, we need bit of machinery, then we have to encode the table in the language that we use:

class State:
    def __init__(self):
        self.stack = []
        self.top = 0

def advance(st, next_symbol):
    while state[st.top][next_symbol](st):
        pass

def shift(to):
    def _shift_(st):
        print('shift {}'.format(to))
        st.stack.append(st.top)
        st.top = to
        return False
    return _shift_

def red(count, symbol):
    def _red_(st):
        for _ in range(count):
            st.top = st.stack.pop()
        print('reduce {}'.format(symbol))
        state[st.top][symbol](st)
        return True
    return _red_

def accept():
    def _accept_(st):
        st.top = st.stack.pop()
        print 'accepted'
        return False
    return _accept_

state = [
    {'$':           red(0, 'program'),
     'varDecl':     red(0, 'program'),
     'constDecl':   red(0, 'program'),
     'statement':   red(0, 'program'),
     'program':     shift(1)},
    {'$':           accept(),
     'declaration': shift(2),
     'varDecl':     shift(3),
     'constDecl':   shift(4),
     'statement':   shift(5)},
    {'$':           red(2, 'program'),
     'varDecl':     red(2, 'program'),
     'constDecl':   red(2, 'program'),
     'statement':   red(2, 'program')},
    {'$':           red(1, 'declaration'),
     'varDecl':     red(1, 'declaration'),
     'constDecl':   red(1, 'declaration'),
     'statement':   red(1, 'declaration')},
    {'$':           red(1, 'declaration'),
     'varDecl':     red(1, 'declaration'),
     'constDecl':   red(1, 'declaration'),
     'statement':   red(1, 'declaration')},
    {'$':           red(1, 'declaration'),
     'varDecl':     red(1, 'declaration'),
     'constDecl':   red(1, 'declaration'),
     'statement':   red(1, 'declaration')}]

Now you have a recognizer, you can recognize sentences in your language like this:

# Successful parse
st = State()
advance(st, 'statement')
advance(st, 'varDecl')
advance(st, '$')
print(st.stack)
print(st.top)

# Failing parse
st = State()
advance(st, 'statement')
advance(st, 'foo')

Route the output of your lexical analyzer into the 'advance' and you are recognizing strings. To turn it into a parser you add a routine to build values within the 'shift' and 'reduce' -steps.

How would it look like then? It would look like this:

class State:
    def __init__(self):
        self.stack = []
        self.forest = []
        self.top = 0

def advance(st, next_symbol, value):
    while state[st.top][next_symbol](st, value):
        pass

def shift(to):
    def _shift_(st, value):
        st.stack.append(st.top)
        st.forest.append(value)
        st.top = to
        return False
    return _shift_

def red(count, symbol, name):
    def _red_(st, value):
        args = []
        for _ in range(count):
            st.top = st.stack.pop()
            args.append(st.forest.pop())
        args.append(":" + name)
        args.reverse()
        state[st.top][symbol](st, args)
        return True
    return _red_

def accept():
    def _accept_(st, value):
        st.top = st.stack.pop()
        item = st.forest.pop()
        print('accepted {}'.format(item))
        return False
    return _accept_

state = [
    {'$':           red(0, 'program', 'empty'),
     'varDecl':     red(0, 'program', 'empty'),
     'constDecl':   red(0, 'program', 'empty'),
     'statement':   red(0, 'program', 'empty'),
     'program':     shift(1)},
    {'$':           accept(),
     'declaration': shift(2),
     'varDecl':     shift(3),
     'constDecl':   shift(4),
     'statement':   shift(5)},
    {'$':           red(2, 'program', 'sequence'),
     'varDecl':     red(2, 'program', 'sequence'),
     'constDecl':   red(2, 'program', 'sequence'),
     'statement':   red(2, 'program', 'sequence')},
    {'$':           red(1, 'declaration', 'var'),
     'varDecl':     red(1, 'declaration', 'var'),
     'constDecl':   red(1, 'declaration', 'var'),
     'statement':   red(1, 'declaration', 'var')},
    {'$':           red(1, 'declaration', 'const'),
     'varDecl':     red(1, 'declaration', 'const'),
     'constDecl':   red(1, 'declaration', 'const'),
     'statement':   red(1, 'declaration', 'const')},
    {'$':           red(1, 'declaration', 'statement'),
     'varDecl':     red(1, 'declaration', 'statement'),
     'constDecl':   red(1, 'declaration', 'statement'),
     'statement':   red(1, 'declaration', 'statement')}]

# Successful parse
st = State()
advance(st, 'statement', '123 = 4')
advance(st, 'varDecl', 'var abc')
advance(st, '$', None)
print(st.stack)
print(st.top)

# Failing parse
st = State()
advance(st, 'statement', '123 = 4')
advance(st, 'foo', 'foo')

Likewise adding line/column numbering isn't hard.

Layout sensitivity?

So after you read this, and you've been fan of a Python, you are going to ask how to get layout sensitivity, off-side rule, indentation syntax, it's got many names.

If you can formally model your language, you don't depend on any purpose-built tools. So I describe you a working way to model a layout-sensitive language with context-free grammars first.

So bit of theory first.. When layout has a syntactic meaning, how would you describe what layout is? In simplest sense of the word, I think it means that the "box" or shape of the sentence is interpreted as a structure. For example, if you have addition spread across two lines:

1234
 + 2345

You could imagine this forms a box that encloses this expression. Likewise every lexeme forms its own box that encloses the lexeme.

You are going to need a special lexeme labeled 'VALIGN'. This lexeme is recognized if the adjacent symbols align vertically.

Next we are going to need a modifier 'WFB'. This denotes that a symbol coming after it must form a well-formed-boundary. The idea with a well-formed-boundary is that the leftmost symbol is the only symbol that hits the left boundary of the box that encloses the expression.

Now we can describe a statement block like this:

block → WFB statement
block → block VALIGN WFB statement

You could have different kind of constraints defined in a same way. It'd be sensible to require syntactic elements to right-align, or to align in square, or about anything you can think of. Though these two constraints will provide a convenient implementation in the LR parser.

The WFB can be implemented by inserting a layout constraint while shifting. Implement it right and the reduction step erases the constraint when it no longer applies. Likewise you cloak the lookahead set below behind the WFB so there are no conflicts that would be there without the layout.

The 'VALIGN' can be interpreted as a prefix for the symbol following it. You ignore it in the lookahead-sets and mark a constraint on the shift: "the item shifted in must align vertically with previously shifted item".

The layout sensitivity introduces two possible new conflicts: WFB and VALIGN -conflicts. These happen when the shifts do not align or constraint like they should.

That's all, you can make your language layout sensitive this way. If you don't believe I show you here, with a language that has a block with some statements:

0: block → ∘ WFB statement {$}
   block → ∘ block VALIGN WFB statement {$}
   statement → ∘ identifier eq expr [wfb] {wfb,$}
   statement → ∘ expr [wfb] {wfb,$}
1: block → WFB statement ∘ {$}
2: block → block ∘ VALIGN WFB statement {$}
   statement → ∘ identifier eq expr [valign,wfb] {wfb,$}
   statement → ∘ expr [valign,wfb] {wfb,$}
3: block → block VALIGN WFB statement ∘ {$}
4: statement → identifier ∘ eq expr {wfb,$}
5: statement → identifier eq ∘ expr {wfb,$}
6: statement → identifier eq expr ∘ {wfb,$}
7: statement → expr ∘ {wfb,$}

Note that wfb -conceals 'statement' because of the VALIGN. Otherwise it would not.

Ok, next the decision table:

0: statement:  Shift 1
   block:      Shift 2
   identifier: WFB Shift 4
   expr:       WFB Shift 7
1: $:          Reduce 1 block
2: statement:  Shift 3
   identifier: VALIGN WFB Shift 4
   expr:       VALIGN WFB Shift 7
3: $:          Reduce 2 block
4: eq:         Shift 5
5: expr:       Shift 6
6: wfb:        Reduce 3 statement
   $:          Reduce 3 statement
7: wfb:        Reduce 1 statement
   $:          Reduce 1 statement

There you have it. With this you can build indented languages where the block statement can be inlined. For example:

sub loader:
  element.addListener('click',
    sub x, y: console.log(x)
              console.log(y))

Also you can parse those transition tables, or grammars without end markers.

A tokenizer

I know somebody gets excited about building a programming language. Therefore I give you a short snippet that shows how to write a tokenizer. This is a finite state machine, and when you plug 'advance' to the place of the 'on_output' and shape it a bit, you can attach this tokenizer to the parser.

Just like before, you can stick to finite state machines to be formal enough that you have a properly defined language.

class Tokenizer:
    def __init__(self, on_output):
        self.on_output = on_output
        self.state = 'st_0'
        self.column = 1
        self.line   = 1
        self.pos = (1,1)
        self.inp = []

def st_0(tok, ch):
    if ch.isdigit():
        tok.pos = (tok.column, tok.line)
        tok.inp.append(ch)
        tok.state = 'st_digits'
    elif ch.isalpha() or ch == "_":
        tok.pos = (tok.column, tok.line)
        tok.inp.append(ch)
        tok.state = 'st_word'
    elif ch == " " or ch == "\n" or ch == "\t" or ch == "\r":
        pass
    else:
        tok.on_output('error', ch,
            (tok.column, tok.line), (tok.column+1, tok.line))

def st_word(tok, ch):
    if ch.isalpha() or ch == "_" or ch.isdigit():
        tok.inp.append(ch)
    else:
        tok.on_output('word', "".join(tok.inp),
            tok.pos, (tok.column+1, tok.line))
        tok.inp = []
        tok.state = 'st_0'
        st_0(tok, ch)

def st_digits(tok, ch):
    if ch.isdigit():
        tok.inp.append(ch)
    else:
        tok.on_output('digits', "".join(tok.inp),
            tok.pos, (tok.column+1, tok.line))
        tok.inp = []
        tok.state = 'st_0'
        st_0(tok, ch)

# Collects every 'st_' into a dictionary.
tokenize_n = dict((k,v) for k,v in globals().items()
    if k.startswith('st_'))

def tokenize(tok, ch):
    tokenize_n[tok.state](tok, ch)
    if ch == "\n":
        tok.line += 1
        tok.column = 1
    else:
        tok.column += 1

Here's how you use this:

def print_on_output(item, text, start, stop):
    print((item, repr(text), start, stop))

tok = Tokenizer(print_on_output)
with open("input-file", "r") as fd:
    for ch in fd.read():
        tokenize(tok, ch)

Note that if you allow the parser to trigger multiple times and connect 'accept' to a responder of some kind, it lets you build communication protocols with this thing.

A little advice on error recovery

There's overall good advice on writing good language interfaces that I'd like to tell you about. The advice is: Don't let the parser stop on errors.

You see it's a good idea to report an error and not let the program proceed through a point where the error happened. But it's a generally bad idea to have the program stop there and stop all the work.

By the time you get that first error there's a subsequent question of what else did go wrong.

The layout parser presented earlier does especially help here. It precisely marks out points where you can drop out and expect the program follows the same shape once it's been fixed.

Computer 'assisted' LR decision table generation

Now lets generate an another decision table, except that this time we let the computer do the bulk work.

Lets use the same grammar as before:

⊤ → program
program → 
program → program declaration 
declaration → varDecl
declaration → constDecl
declaration → statement

lexemes: varDecl,constDecl,statement

We can represent the whole grammar like this:

grammar = [
    (None, ['program']),
    ('program', []),
    ('program', ['program', 'declaration']),
    ('declaration', ['varDecl']),
    ('declaration', ['constDecl']),
    ('declaration', ['statement']),
]
lexemes = ['varDecl', 'constDecl', 'statement']

The dotted rules are pairs of indices because those are easy to work with.

def print_grammar():
    for lhs, rhs in grammar:
        print("{} → {}".format(lhs or '⊤', " ".join(rhs)))

def print_item(prefix, (rule, index)):
    lhs, rhs = grammar[rule]
    print("{}{} → {}".format(prefix, lhs or '⊤',
        " ".join(rhs[:index] + ['∘'] + rhs[index:])))

def print_itemset(index, items):
    prefix = "{}: ".format(index)
    for item in items:
        print_item(prefix, item)
        prefix = " " * len(prefix)

Now if we wanted to print the first itemset, it'd look like this:

print_itemset(0, set([
    (0,0),
    (1,0),
    (2,0),
]))

Implementation of the prediction step:

def after_dot((rule, index)):
    lhs, rhs = grammar[rule]
    if index < len(rhs):
        return rhs[index]

def predict(items):
    prediction = set(items)
    p = len(prediction)
    while len(items) > 0:
        sym = after_dot(items.pop())
        for index, (lhs,rhs) in enumerate(grammar):
            if sym == lhs and sym is not None:
                prediction.add((index, 0))
                if p < len(prediction):
                    p = len(prediction)
                    items.append((index,0))
    return prediction

To be honest I didn't get that right away. The "and sym is not None" was added in the post. And now we test it:

print_itemset(0, predict([(0,0)]))

You get the whole thing in a bit odd order, but it's the initial itemset from the earlier.

0: program → ∘ program declaration
   program → ∘
   ⊤ → ∘ program

Now we can partition it into shifting sets:

def partition(items):
    groups = {}
    for item in items:
        sym = after_dot(item)
        if sym is not None:
            item = (item[0], item[1]+1)
        try:
            groups[sym].append(item)
        except KeyError as _:
            groups[sym] = [item]
    return [(sym, frozenset(items)) for sym,items in groups.items()]

And again try things out:

for sym, items in partition(predict([(0,0)])):
    print_itemset(sym, items)

And we have some rudimentary shifting sets here.

program: ⊤ → program ∘
         program → program ∘ declaration
None: program → ∘

Now we can compute all itemsets:

itemsets = [ frozenset([(0,0)]) ]
itemsets_index = dict((s,i) for i,s in enumerate(itemsets))
vectors = []
full_itemsets = []
shifts = []
reductions = []
for k, itemset in enumerate(itemsets):
    vectors.append(tuple(itemset))
    pset = predict(list(itemset))
    full_itemsets.append(pset)
    print_itemset(k, itemset)
    k_shifts = {}
    k_reductions = set()
    for sym, items in partition(pset):
        if sym is None:
            k_reductions.update(items)
        else:
            try:
                j = itemsets_index[items]
            except KeyError as _:
                j = len(itemsets)
                itemsets_index[items] = j
                itemsets.append(items)
            k_shifts[sym] = j
    shifts.append(k_shifts)
    reductions.append(k_reductions)
print shifts
print reductions

The printout is the same as what we did before by hand:

0: program → ∘ program declaration
   program → ∘
   ⊤ → ∘ program
1: ⊤ → program ∘
   declaration → ∘ varDecl
   declaration → ∘ statement
   program → program ∘ declaration
   declaration → ∘ constDecl
2: declaration → constDecl ∘
3: declaration → varDecl ∘
4: declaration → statement ∘
5: program → program declaration ∘
[{'program': 1},
 {'varDecl': 3, 'constDecl': 2, 'statement': 4, 'declaration': 5},
 {}, {}, {}, {}]
[set([(1, 0)]),
 set([(0, 1)]),
 set([(4, 1)]),
 set([(3, 1)]),
 set([(5, 1)]),
 set([(2, 2)])]

We see the shifts and reductions correspond with what we wrote by hand. Now we have the itemsets resolved out.

Computation of EMPTY -sets

This is something I didn't cover, but it's needed for computing the FIRST and FOLLOW sets.

def empty_symbols():
    symbols = set()
    for lhs,rhs in grammar:
        if len(rhs) == 0:
            symbols.add(lhs)
    m = 0
    n = len(symbols)
    while m < n:
        for lhs, rhs in grammar:
            if all(x in symbols for x in rhs):
                symbols.add(lhs)
        m = n
        n = len(symbols)
    return symbols
empty = empty_symbols()

Empty symbols aren't common things in grammars, but they add bit of extra work.

Computation of FIRST -sets

Construction of 'FIRST' sets were not the simplest thing either. It took me few tries to get it right.

def first_lexemes():
    symbols = dict()
    routes = set()
    for sym in lexemes:
        symbols[sym] = set([sym])
    for lhs, rhs in grammar:
        if lhs not in symbols:
            symbols[lhs] = set([])
    for lhs, rhs in grammar:
        for rhsN in rhs:
            routes.add((lhs,rhsN))
            if rhsN not in empty:
                break
    rep = True
    while rep:
        rep = False
        for lhs, rhs0 in routes:
            n = len(symbols[lhs])
            symbols[lhs].update(symbols[rhs0])
            rep |= n < len(symbols[lhs])
    return symbols

first = first_lexemes()
print first

The result is what you'd expect by now. Bit of a jumble but it has the gist:

{'statement': set(['statement']),
 'varDecl': set(['varDecl']),
 'program': set(['statement', 'varDecl', 'constDecl']),
 None: set(['statement', 'constDecl', 'varDecl']),
 'declaration': set(['varDecl', 'constDecl', 'statement']),
 'constDecl': set(['constDecl'])}

Next the follow set for each itemset.

Computation of FOLLOW -sets

FOLLOW set records the information of what can follow which item. This allows to find out how the predicted items should behave.

def follow_lexemes(seedset, full_itemset):
    symbols = {}
    seeds = {}
    routes = set()
    for item in full_itemset:
        sym0 = after_dot(item)
        if sym0 not in symbols:
            symbols[sym0] = set()
            seeds[sym0] = set()
    for rule,index in full_itemset:
        lhs,rhs = grammar[rule]
        if index < len(rhs):
            rhs0 = rhs[index]
            k = index+1
            for k in range(index+1, len(rhs)):
                symbols[rhs0].update(first[rhs[k]])
                if rhs[k] not in empty:
                    break
            if k == len(rhs):
                if (rule,index) in seedset:
                    seeds[rhs0].add((rule,index))
                else:
                    routes.add((lhs, rhs0))
    rep = True
    while rep:
        rep = False
        for lhs, sym in routes:
            n = len(symbols[lhs])
            symbols[lhs].update(symbols[rhs0])
            rep |= n < len(symbols[lhs])
            n = len(seeds[lhs])
            seeds[lhs].update(seeds[rhs0])
            rep |= n < len(seeds[lhs])
    return symbols, seeds

And then we try it out:

follow_syms = []
follow_seeds = []

for i in range(len(itemsets)):
    syms,seeds = follow_lexemes(itemsets[i], full_itemsets[i])
    follow_syms.append(syms)
    follow_seeds.append(seeds)
    print i
    print syms
    print seeds

These sets are quite barren in the grammar that we test with. If I did a mistake anywhere here, I guess it's not visible yet.

0
{'program': set(['constDecl', 'varDecl', 'statement']), None: set([])}
{'program': set([(0, 0)]), None: set([])}
1
{'constDecl': set([]), 'varDecl': set([]), 'statement': set([]),
 None: set([]), 'declaration': set([])}
{'constDecl': set([]), 'varDecl': set([]), 'statement': set([]),
 None: set([]), 'declaration': set([(2, 1)])}
2
{None: set([])}
{None: set([])}
3
{None: set([])}
{None: set([])}
4
{None: set([])}
{None: set([])}
5
{None: set([])}
{None: set([])}

We've went through what this should mean, so it should be possible to wean out any bugs by inspecting the results once they're there.

Build-up of the LR(1) decision tables

If you were reading through the snippet, you may be wondering what we are going to do with the vectors -structure, or then you're really smart and figured it out.

We are going to use the lookahead sets as indices into a 'function' that builds the full decision table.

fin_index = {}
fin_vectors = []
fin_tabs = []
conflicts = {}

def build_decision_table(k, *args):
    fin_index[(k,)+args] = tab_index = len(fin_vectors)
    fin_vectors.append((k,)+args)
    tab = {}
    fin_tabs.append(tab)
    assert len(vectors[k]) == len(args)
    seed_lookahead = dict(zip(vectors[k],args))
    syms = follow_syms[k]
    seeds = follow_seeds[k]
    for sym, j in shifts[k].items():
        args = (j,) + tuple(
            frozenset(followup(k, seed_lookahead, (s_item[0], s_item[1]-1)))
            for s_item in vectors[j])
        if args in fin_index:
            tab[sym] = (0, fin_index[args])
        else:
            tab[sym] = (0, build_decision_table(*args))
    had_conflicts = []
    for reditem in reductions[k]:
        for sym in followup(k, seed_lookahead, reditem):
            action = ('reduce',
                grammar[reditem[0]][0],
                len(grammar[reditem[0]][1]),
                reditem[0])
            if sym in tab:
                if (k,sym) in conflicts:
                    conflicts[(k,sym)].append(action)
                else:
                    conflicts[(k,sym)] = [tab[sym], action]
                    had_conflicts.append((k,sym))
            else:
                tab[sym] = action
    if len(had_conflicts) > 0:
        print("Conflicts:".format(had_conflicts))
        for cnf in had_conflicts:
            print(" {}: {}".format(cnf, conflicts[cnf]))
    return tab_index

build_decision_table(0, frozenset([None]))
if len(conflicts) > 0:
    print fin_tabs
    print conflicts
else:
    print fin_tabs

Just like I adviced on building actual parser, this thing collects conflicts and presents them as they appear. Time to see how our grammar baked through:

[{'constDecl': ('reduce', 'program', 0, 1),
  'program':   (0, 1),
  'varDecl':   ('reduce', 'program', 0, 1),
  'statement': ('reduce', 'program', 0, 1),
  None:        ('reduce', 'program', 0, 1)},
 {'constDecl':   (0, 3),
  'varDecl':     (0, 2),
  None:          ('reduce', None, 1, 0),
  'statement':   (0, 4),
  'declaration': (0, 5)},
{'varDecl':   ('reduce', 'declaration', 1, 3),
 'constDecl': ('reduce', 'declaration', 1, 3),
 None:        ('reduce', 'declaration', 1, 3),
 'statement': ('reduce', 'declaration', 1, 3)},
{'varDecl':   ('reduce', 'declaration', 1, 4),
 'constDecl': ('reduce', 'declaration', 1, 4),
 None:        ('reduce', 'declaration', 1, 4),
 'statement': ('reduce', 'declaration', 1, 4)},
{'varDecl':   ('reduce', 'declaration', 1, 5),
 'constDecl': ('reduce', 'declaration', 1, 5),
 None:        ('reduce', 'declaration', 1, 5),
 'statement': ('reduce', 'declaration', 1, 5)},
{'varDecl':   ('reduce', 'program', 2, 2),
 'constDecl': ('reduce', 'program', 2, 2),
 None:        ('reduce', 'program', 2, 2),
 'statement': ('reduce', 'program', 2, 2)}]

It looks a lot like it succeeded.

That WFB/VALIGN -thing?

I didn't get to do it yet, but I know you may be very interested about how to generate decision tables for layout-constrained grammars.

Well first of all you need to have 'wfbconstraints' and 'valignconstraints' -sets. These will provide indices to rules and items with these constraints.

wfb_constraints = [(1,0), (1,1)]
valign_constraints = [(1,1)]

Next you have to check during prediction, that every prediction were made with same WFB/VALIGN guards. Mark the WFB/VALIGN conditions into shifting symbols. Bit like this:

def predict(items):
    prediction = set(items)
    wfb = set()
    valign = set()
    cconflicts = []
    p = len(prediction)
    while len(items) > 0:
        this = items.pop()
        has_wfb = this in wfb_constraints or this in wfb
        has_valign = this in valign_constraints or this in valign
        sym = after_dot(this)
        for index, (lhs,rhs) in enumerate(grammar):
            if sym == lhs and sym is not None:
                prediction.add((index, 0))
                if p < len(prediction):
                    p = len(prediction)
                    items.append((index,0))
                    if has_wfb:
                        wfb.add((index,0))
                    if has_valign:
                        valign.add((index,0))
                else:
                    if ((index,0) in wfb) ^ has_wfb:
                        cconflicts.append(('wfb', (index,0)))
                    if ((index,0) in valign) ^ has_valign:
                        cconflicts.append(('valign', (index,0)))
    return prediction, wfb, valign, cconflicts

In the follow-set buildup you detect WFB+VALIGN -guarded clauses.

for rule,index in full_itemset:
    lhs,rhs = grammar[rule]
    has_wfb = (rule,index) in wfb_constraints
    if index < len(rhs):
        rhs0 = rhs[index]
        k = index+1
        for k in range(index+1, len(rhs)):
            if (rule,k) in valign_constraints:
                symbols[rhs0].add('wfb')
            else:
                symbols[rhs0].update(first[rhs[k]])
            if rhs[k] not in empty:
                break
        if k == len(rhs):
            if (rule,index) in seedset:
                seeds[rhs0].add((rule,index))
            else:
                routes.add((lhs, rhs0))

Finally when building the decision tables, mark the states with the guards.

def partition(items, wfb, valign, cconflicts):
    groups = {}
    modes = {}
    for item in items:
        sym = after_dot(item)
        mode = (int(item in wfb) << 1) | int(item in valign)
        if sym is not None:
            item = (item[0], item[1]+1)
        try:
            groups[sym].append(item)
            if modes[sym] != mode:
                cconflicts.append(('shift+valign+wfb', sym))
        except KeyError as _:
            groups[sym] = [item]
            modes[sym] = mode
    return [(sym, frozenset(items), modes[sym])
        for sym, items in groups.items()]

I've supplied the whole code for the generator into the parsergen.py -file and I attached it into this blog post.

Anyway the real point of this point was not to share code snippets, but to illustrate how simple and easy it is to build a LR parser, even by hand. Although the work should be carefully inspected afterwards. Make some mistake into the table and it won't parse the right language.

Now's your chance

If you contact me this October, either by email or sending a message to, the reddit post in r/programminglanguages that discusses this blog post. I will help you build a parser for your language, as featured above. And if you like to, I will feature your language in this blog, among other people's languages.

It doesn't need to be a programming language or otherwise a serious project. The finished language can be written any common programming language, as long as you know that language well yourself. Though, I will stick to using Python for the parsing generator part.

Every submitted language will be described with the following kind of grammar:

grammar: 
grammar: grammar VALIGN WFB declaration
declaration:
    identifier ":" rhs_block
    "lexemes" identifiers
rhs_block:
rhs_block: rhs_block VALIGN WFB rhs
rhs: 
rhs: rhs symbol
symbol: "VALIGN" symbol
        "WFB" symbol
        identifier
        string
identifiers: identifier
             identifiers "," identifier
lexemes identifier, string

You don't need to have this kind of a grammar with you when you contact, but I will make the parser generator use this grammar later this week.

Usecases

Obviously if you wanted to build a programming language you need a parser. But it could be that you just stumbled on this and wonder if it's useful for you.

Here's example of some things you can do with a working LR parser:

  1. Some communication protocols are described with context-free grammars. In those cases it's possible you can use LR parser to interpret the protocol.
  2. C programming language has a LR-specific parser hack where it inserts new types through 'typedef' statements. It takes effect immediately after semicolon of the typedef statement and I think that followed scopes as well. So it's not very fun to parse with other means than implementing the LR parser and the syntax hack.
  3. LR parsers can be used in inverse, just like many other parsers. Instead of giving in symbols you may gather the symbols the parser expects to see and then present them to the user. It could be useful for somebody who wants to write a game or a graphical editor of some kind. Or otherwise implement some kind of a different user interface.

Similar posts