Layout sensitive LR(1) algorithm analysis

In the previous post I introduced an algorithm for generating layout sensitive parsers. I found out the algorithm didn't quite work yet.

This time I take a bit more formal stance. I correct the algorithm and provide an implementation of it.

The issues in the previous algorithm

The problem previously was that we didn't recognize every case. I tried to build the tabless in a slightly wrong way.

Some false-positive conflicts

In the following grammar the previous algorithm incorrectly concluded that we have a conflict because the follow set of grammar rule ends up not constrained.

grammar → ∘ WFB declaration
grammar → ∘ grammar WFB VALIGN declaration
declaration → ∘ identifier ":" rhs_block

Prediction of grammar → ∘ grammar WFB VALIGN declaration ends up incorrectly expecting that grammar -rule is followed by 'declaration' and there is no WFB guard.

This means that when we shift with grammar and get the following state:

grammar → grammar ∘ WFB VALIGN declaration

And it will wait for 'declaration' to come after a 'grammar' and concludes there is a shift/reduce conflict, whereas there actually is not because the declarations would be separated by the well-founded-block and the vertical alignment constraint.

Modifications to the LR parser not described well.

The another cue that something's wrong here is that I didn't describe the parsing process itself well enough. How we modify the LR parser determines what kind of decision tables we should build.

The layout-sensitive LR parsing algorithm

To model the parsing algorithm, I sketch it in Idris:

data ParserState : Type -> Type where
    LayoutLR : (top : Nat)
            -> (layout : Nat)
            -> (left_bound : Nat)
            -> (stack : List (Nat, Nat, Nat, a))
            -> ParserState a

depth : ParserState a -> Nat
depth (LayoutLR _ _ _ stack) = length stack

This is not yet a valid model of the parser. We would have to state that a parsing state is valid in respect to some valid decision table.

Initial state:

init : ParserState a
init = LayoutLR 0 0 1 []

Push state:

push : ParserState a -> Nat -> Nat -> Nat -> a -> ParserState a
push (LayoutLR top layout left_bound stack)
    top' layout' left_bound' item
    = LayoutLR top' layout' left_bound'
      ((top, layout, left_bound', item) :: stack)

Pop state:

pop : (st:ParserState a) -> {s:depth st = S k} -> (a, ParserState a)
pop (LayoutLR _ _ _ ((top,layout,left,item::stack)) {s=Refl}
    = (item, LayoutLR top layout left stack)

multipop : (st:ParserState a) -> (n : Nat) -> (LTE k n)
        -> (List a, ParserState a)
multipop st Z _ = ([], st)
multipop st (S m) pf = case pop st Refl of
    (x, st') => case multipop st' m ?sub_pf of
        (xs, st'') => (xs ++ [x], st'')

Now we model the transitions:

data Transition : Type -> Type where
    Accept : Transition a
    Shift : (to:Nat)
         -> (left_bound:Nat)
         -> (item:a)
         -> (wfb:Bool)
         -> (valign:Bool) -> Transition a
    Reduce : (rule:Nat)
         -> (depth:Nat)
         -> (lhs:Symbol)
         -> Transition a
    Error : ErrorReport -> RecoveryPlan -> Transition a

Finally we can model the valid state transitions for the automaton:

data ValidTransition : Transition a
                    -> (before:ParserState a)
                    -> (after:ParserState a)
                    -> Type where

In the layouted shift, we have to consider the valign and wfb flags in the transition. If the valign is set the previous sentence must left-align with the shifted item. Likewise if the symbol starts a well-founded-block, it must constrain the layout.

    ValidShift
        : Either
            (valign = False)
            (valign = True, left = item_col)
       -> Either
            (wfb = False, new_layout = layout)
            (wfb = True,  new_layout = item_col)
       -> before = LayoutLR top layout left stack
       -> ValidTransition
            (Shift to item_col item wfb valign)
            before (push before to new_layout item_col item)

Likewise the reduction must pop n items from the state, then shift it back in at reduced form.

    ValidReduce
        : (items, after) = multipop before depth it_is_valid_state
       -> after' = after `lhs_shift` lhs left (reduction rule items)
       -> ValidTransition (Reduce rule depth lhs) before after'

The shift after reduction into a different state is included in the definition of reduction. However there's an important detail there. LHS-shifts have to consider the WFB alignment, yet they can ignore the vertical alignment because the state won't be reached if the vertical alignment rule is not satisfied.

Finally we have conditions that prevent progress in parsing. They need to be modeled as well, though I haven't ended up modeling them in full.

data ValidTermination : Transition a
                     -> (before:ParserState a)
                     -> Type where
    ValidAccept
        : ValidTermination (Accept) (LayoutLR _ _ _ ((_,_,_,_)::_))
    ValidError
        : ValidTermination (Error report recovery_plan) before

So there we have the LR parsing state and its transitions modeled.

Overall it shows that we have one extra lookahead symbol that stands for all symbols that end up to the left side of a layout column. As reductions are made the layout column retracts.

One thing not modeled here is how to pick the action from the decision table but it's quite easy: If the 'layout' number is smaller than 'left' in the symbol then use the WFB-action. Otherwise use the action marked for the symbol. End of file marker should always stay end-of-file marker though.

def get_action(sym, left):
    if p.layout < left or sym is None:
        return state[p.top][sym]
    else:
        return state[p.top]['wfb']

To prove correctness of the whole algorithm, we would have to first prove that LR parser is deciding whether a sentence is valid in respect to some grammar.

Assuming that LR parsing works for this purpose, we should be able to recognize some limited form of layout using the rules we have.

For now I don't go to proving LR parsers correct using Idris. Not yet anyway. Anyway the above helped me to see where the layout parsing extension went wrong.

Item set buildup

In the previous implementation I modified the prediction step to find WFB/VALIGN items. I think this was not a good idea because it can be done without modifying the itemset construction.

To recap, I explain the itemset buildup here. You get an item when you add a dot to a production rule. The dot represents a position during the parsing.

To start you create a production rule such as ⊤ → program and you give a dot to it.

⊤ → ∘ program

This above set is a seed itemset. The next step is called prediction because you're doing it to predict which rules start in this itemset. You add the following rules into the seed item set and get the whole itemset:

program → ∘ expression
program → ∘ program expression
expression → ∘ expression identifier
expression → ∘ identifier

Inside the whole itemset, you shift the dot right in every rule and group them by which symbol you shifted the dot over:

program: ⊤ → program ∘
         program → program ∘ expression

expression: program → expression ∘
            expression → expression ∘ identifier

identifier: expression → identifier ∘

These are the new itemsets, and then you repeat these steps until you don't find any new seed itemsets:

0: ⊤ → ∘ program
   ------------------------------------
   program → ∘ expression
   program → ∘ program expression
   expression → ∘ expression identifier
   expression → ∘ identifier

1: ⊤ → program ∘
   program → program ∘ expression
   ------------------------------------
   expression → ∘ expression identifier
   expression → ∘ identifier

2: program → expression ∘
   expression → expression ∘ identifier
   ------------------------------------

3: expression → identifier ∘
   ------------------------------------

4: program → program expression ∘
   expression → expression ∘ identifier
   ------------------------------------

5: expression → expression identifier ∘
   ------------------------------------

Be sure to keep it separate which items in each itemset are seeds and which ones are predicted items.

If you see that there wouldn't be conflicts if you build the decision tables from above itemsets, then you have a grammar that can be recognized by LR(0) parser.

Recognizing shifting modes

To recognize whether to shift with WFB or VALIGN, we have to recognize whether to mark or to not mark lexemes in each item set.

There was details that I didn't catch earlier on though.

VALIGN only affects the lexemes because it constraints shifting.
WFB -markings affect items in itemsets.
Both VALIGN and WFB may have weird behavior with empty rules. For now lets constrain that WFB/VALIGN -marked symbol must not recognize an empty string.

To compute WFB, lets look at the itemsets. The seed itemsets are marked to be '!wfb'.

If there's an item that predicts something with WFB -tag, the symbol becomes WFB-marked, for example below the expression would become WFB-marked:

program → ∘ WFB expression

Likewise if there's '!wfb' -tag in LHS and you have prediction without WFB-mark, then the '!wfb' transfers. Below the 'documentation' symbol would become '!wfb' if 'program' has '!wfb' on it.

program → ∘ documentation

So eventually we find out which items are 'wfb' and which are not.

On symbols we do the same thing. We'll mark the LHS symbols in seed items '!valign' and then flow it similarly except that the valign shifts are dropped from non-lexemes afterwards.

If there are conflicts then you find out you'd have to mark shifts with 'wfb' and '!wfb', or with 'valign' and '!valign'.

FOLLOW -set buildup

Aside knowing which shift to execute, we need to build up the accurate FOLLOW sets and then apply them for predicting under which conditions each predicted rule will reduce.

Without layouting we'd get correct prediction for each item by looking at the LHS symbol and then reading what comes after the LHS in each itemset.

But now the FOLLOW and FIRST -sets both have to be splitted, such that we separate vertically aligning lexemes from ordinary lexemes.

If the item is WFB-marked, we do the following:

Add contents of FOLLOW(itemset, lhs) to lookahead of item.
If VFOLLOW(itemset, lhs) is nonempty, add 'wfb' to lookahead of item.

Without a WFB-mark, we will just merge the FOLLOW and VFOLLOW symbol sets and make that the lookahead.

There's a special case where a seed-item in a itemset may have to be 'wfb' -marked. Meaning it will get to be 'wfb' -shifted. Two conditions must fullfill:

The item ends with a WFB -marked item.
The WFB-shift is guaranteed to happen with same 'layout' -number as what the item started with.

These special seed-items are treated as if they were WFB-marked. They get the appropriate lookaheads and their last shift before reduction will be marked with WFB.

That's the whole thing and I don't think it needs other modifications. There's an another thing I want to pick up.

Smarter way to compute FIRST -set

This page about LALR(1) parsing was pointed out to me by a person who's working on a replacement language for bash.

It leads to the page that has an easy explanation of the First and Follow sets.

It still may not open up that easily, but the idea is to extend the idea of first and follow -sets over to sequences.

Implementation in the parsergen3.py

I revised the parser generator presented in the previous post. It still uses the same in-memory presentation for the grammar:

grammar = [
    (None, ['program']),
    ('program', []),
    ('program', ['program', 'expression']),
    ('expression', ['identifier']),
]
lexemes = ['identifier']
wfb_constraints = []
valign_constraints = []

Note that if you insert rules into the grammar, the wfb and valign constraints have to be updated as well.

Now here's the itemset buildup, there are no tricks or specials being done in this step.

# The following routine builds the LR0 itemsets
def after_dot((rule, index)):
    lhs, rhs = grammar[rule]
    if index < len(rhs):
        return rhs[index]

def predict(items):
    visited = set(items)
    prediction = set()
    while len(items) > 0:
        sym = after_dot(items.pop())
        for index, (lhs,rhs) in enumerate(grammar):
            if sym == lhs and sym is not None:
                item = (index, 0)
                if item not in visited:
                    items.append(item)
                    prediction.add(item)
                    visited.add(item)
    return prediction

def partition(items):
    groups = {}
    for item in items:
        sym = after_dot(item)
        # The items to be shifted are already shifted here,
        # so they won't need the treatment later.
        if sym is not None:
            item = (item[0], item[1]+1)
        try:
            groups[sym].append(item)
        except KeyError as _:
            groups[sym] = [item]
    return [(sym, frozenset(items)) for sym,items in groups.items()]

seed_itemsets = [ frozenset([(0,0)]) ]
prediction_itemsets = []
itemsets_index = dict((s,i) for i,s in enumerate(seed_itemsets))
shifts = []
reductions = []
vectors = []
for k, seed_itemset in enumerate(seed_itemsets):
    prediction_itemset = predict(list(seed_itemset))
    itemset = seed_itemset | prediction_itemset
    prediction_itemsets.append(prediction_itemset)
    k_shifts = []
    k_reductions = set()
    for sym, items in partition(itemset):
        if sym is None:
            k_reductions.update(items)
        else:
            # If the seed itemset is not already in the table,
            # it'll be added to the table.
            try:
                j = itemsets_index[items]
            except KeyError as _:
                j = len(seed_itemsets)
                itemsets_index[items] = j
                seed_itemsets.append(items)
            k_shifts.append((sym, j))
    # The reductions and shifts are recorded
    # this way they don't need to be recorded later.
    shifts.append(k_shifts)
    reductions.append(k_reductions)
    # Vectors are used for memoizing itemsets later.
    # It is sorted so that lookahead vectors appear in same order
    # as what the itemsets are printed in.
    vectors.append(tuple(sorted(seed_itemset)))
    # Separating and sorting the itemsets provides easy, cleaner printout.
    if print_itemsets_to_stdout:
        print_itemset(k, list(vectors[k]) + list(sorted(prediction_itemset)))

Once the itemsets have been built up, we can start doing the analysis.

# Compute the set of empty symbols in the grammar.
def empty_symbols():
    symbols = set()
    for lhs,rhs in grammar:
        if len(rhs) == 0:
            symbols.add(lhs)
    m = 0
    n = len(symbols)
    while m < n:
        for lhs, rhs in grammar:
            if all(x in symbols for x in rhs):
                symbols.add(lhs)
        m = n
        n = len(symbols)
    return symbols
empty = empty_symbols()

# Check the constraints are well-formed.
empty_valign_wfb = set()
for point in wfb_constraints | valign_constraints:
    sym = after_dot(point)
    if sym in empty:
        empty_valign_wfb.add(sym)

if len(empty_valign_wfb) > 0:
    exit_status = 1
    sys.stderr.write(
        "warning: Constraints in empty symbols\n"
        "    Constraints will end up dangling if their rules are empty\n"
        "    constrained empty symbols:\n"
        "    {}\n".format(",".join(map(repr,sorted(empty_valign_wfb)))))

The computation of FIRST/VFIRST is not different from the before, except that the valignment constraints are considered.

# FIRST & VFIRST -sets construction
def first_lexemes():
    symbols = dict()
    routes = set()
    for sym in lexemes:
        symbols[sym] = set([sym])
    for lhs, rhs in grammar:
        if lhs not in symbols:
            symbols[lhs] = set([])
    for rule, (lhs, rhs) in enumerate(grammar):
        for index, rhsN in enumerate(rhs):
            if (rule,index) not in valign_constraints:
                routes.add((lhs,rhsN))
            if rhsN not in empty:
                break
    rep = True
    while rep:
        rep = False
        for lhs, rhs0 in routes:
            n = len(symbols[lhs])
            symbols[lhs].update(symbols[rhs0])
            rep |= n < len(symbols[lhs])
    return symbols
first = first_lexemes()

def vfirst_lexemes():
    symbols = dict()
    routes = set()
    for sym in lexemes:
        symbols[sym] = set([])
    for lhs, rhs in grammar:
        if lhs not in symbols:
            symbols[lhs] = set([])
    for rule, (lhs, rhs) in enumerate(grammar):
        for index, rhsN in enumerate(rhs):
            if (rule,index) in valign_constraints:
                symbols[lhs].update(first[rhsN])
            if rhsN not in empty:
                break
    rep = True
    while rep:
        rep = False
        for lhs, rhs0 in routes:
            n = len(symbols[lhs])
            symbols[lhs].update(symbols[rhs0])
            rep |= n < len(symbols[lhs])
    return symbols
vfirst = vfirst_lexemes()

I think the first time I implemented the FOLLOW-sets, there were issues. The earlier program updated the sets in a wrong direction.

# FOLLOW -sets
def examine((rule,index)):
    s = set()
    w = set()
    rhs = grammar[rule][1]
    for i in range(index+1, len(rhs)):
        w.update(vfirst[rhs[i]])
        if (rule,i) in valign_constraints:
            w.update(first[rhs[i]])
        else:
            s.update(first[rhs[i]])
        if rhs[i] not in empty:
            return s,w,False
    return s,w,True

def follow_lexemes(seedset, predictionset):
    seeds   = {}
    symbols = {}
    vsymbols = {}
    routes = set()
    for item in seedset | predictionset:
        sym0 = after_dot(item)
        if sym0 not in symbols:
            symbols[sym0] = set()
            vsymbols[sym0] = set()
            seeds[sym0] = set()
    for item in seedset:
        sym0 = after_dot(item)
        syms, vsyms, reductive = examine(item)
        symbols[sym0].update(syms)
        vsymbols[sym0].update(vsyms)
        if reductive:
            seeds[sym0].add(item)
    for item in predictionset:
        sym0 = after_dot(item)
        syms, vsyms, reductive = examine(item)
        symbols[sym0].update(syms)
        vsymbols[sym0].update(vsyms)
        if reductive:
            lhs = grammar[item[0]][0]
            routes.add((lhs, sym0))
    rep = True
    while rep:
        rep = False
        for lhs, sym in routes:
            n = symbols[sym]
            symbols[sym].update(symbols[lhs])
            rep |= n < len(symbols[sym])
            n = vsymbols[sym]
            vsymbols[sym].update(vsymbols[lhs])
            rep |= n < len(vsymbols[sym])
            n = len(seeds[sym])
            seeds[sym].update(seeds[lhs])
            rep |= n < len(seeds[sym])
    return seeds, symbols, vsymbols

It's a good time to compute the WFB/VALIGN sets as well. These follow the specification from the earlier.

def wfb_sets(seedset, predictionset):
    sym_routes = dict()
    for item in predictionset:
        lhs = grammar[item[0]][0]
        if lhs is None:
            continue
        try:
            sym_routes[lhs].add(item)
        except KeyError as _:
            sym_routes[lhs] = set([item])
    routes = []
    nwfb_items = set()
    wfb_items = set()
    for item in predictionset:
        sym0 = after_dot(item)
        for sym_item in sym_routes.get(sym0, ()):
            if item in wfb_constraints:
                wfb_items.add(sym_item)
            else:
                routes.append((item, sym_item))
    for item in seedset:
        sym0 = after_dot(item)
        if item in wfb_constraints:
            wfb_items.update(sym_routes.get(sym0, ()))
        else:
            nwfb_items.update(sym_routes.get(sym0, ()))
    n = 0
    m = len(nwfb_items) + len(wfb_items)
    while n < m:
        n = m
        for item, sym_item in routes:
            if item in wfb_items:
                wfb_items.add(sym_item)
            if item in nwfb_items:
                nwfb_items.add(sym_item)
        m = len(nwfb_items) + len(wfb_items)
    return wfb_items, nwfb_items

def valign_sets(seedset, predictionset):
    routes = []
    valign_syms = set()
    nvalign_syms = set()
    for item in predictionset:
        lhs = grammar[item[0]][0]
        sym0 = after_dot(item)
        if sym0 is not None:
            if item in valign_constraints:
                valign_syms.add(sym0)
            else:
                routes.append((lhs, sym0))
    for item in seedset:
        sym0 = after_dot(item)
        if sym0 is not None:
            if item in valign_constraints:
                valign_syms.add(sym0)
            else:
                nvalign_syms.add(sym0)
    n = 0
    m = len(valign_syms) + len(nvalign_syms)
    while n < m:
        n = m
        for item, sym_item in routes:
            if item in valign_syms:
                valign_syms.add(sym_item)
            if item in nvalign_syms:
                nvalign_syms.add(sym_item)
        m = len(nvalign_syms) + len(valign_syms)
    valign_syms &= lexemes
    nvalign_syms &= lexemes
    return valign_syms, nvalign_syms

Then everything is put together.

follow_seeds = []
follow_syms = []
follow_vsyms = []
follow_wfb = []
follow_nwfb = []
follow_valign = []
follow_nvalign = []

for i in range(len(seed_itemsets)):
    seeds,syms,vsyms = follow_lexemes(seed_itemsets[i], prediction_itemsets[i])
    wfb,nwfb = wfb_sets(seed_itemsets[i], prediction_itemsets[i])
    valign,nvalign = valign_sets(seed_itemsets[i], prediction_itemsets[i])
    follow_seeds.append(seeds)
    follow_syms.append(syms)
    follow_vsyms.append(vsyms)
    follow_wfb.append(wfb)
    follow_nwfb.append(nwfb)
    follow_valign.append(valign)
    follow_nvalign.append(nvalign)

To recognize how to handle the seed itemsets, it's better to recognize the layout condition described earlier on.

def wfb_rules():
    bitmap = []
    for rule, (lhs,rhs) in enumerate(grammar):
        if (rule,len(rhs)-1) not in wfb_constraints:
            is_wfb = False
        else:
            is_wfb = all((((rule,index) in valign_constraints)
                          or (len(first[rhs[index]]) == 0))
                         for index in range(1, len(rhs)))
        bitmap.append(is_wfb)
    return bitmap
wfbr = wfb_rules()

Now there's only one remaining task to do. It's to build the LR(1) itemsets.

def followup(k, seed_lookahead, item):
    if item in seed_lookahead:
        return seed_lookahead[item]
    else:
        # The default resolution
        sym = grammar[item[0]][0]
        lookahead = set(follow_syms[k][sym])
        vlookahead = set(follow_vsyms[k][sym])
        for seeditem in follow_seeds[k][sym]:
            lookahead.update(seed_lookahead[seeditem])
        if (item in wfbs[k]) or wfbr[item[0]]:
            if len(vlookahead) > 0:
                lookahead.add('wfb')
            return lookahead
        else:
            return lookahead | vlookahead

def previous(items):
    for rule,index in items:
        yield (rule, index-1)

lr_mapping = [(0, frozenset([None]))]
lr_index = dict((s,i) for i,s in enumerate(lr_mapping))
lr_tabs = []
lr_conflicts = {}

for mapping in lr_mapping:
    k = mapping[0]
    tab = {}
    lr_tabs.append(tab)
    assert 1 + len(vectors[k]) == len(mapping)
    seed_lookahead = dict(zip(vectors[k], mapping[1:]))
    for sym, j in shifts[k]:
        to_mapping = (j,) + tuple(
            frozenset(followup(k, seed_lookahead, s_item))
            for s_item in previous(vectors[j]))
        flags = 0
        if sym in valigns[k]:
            if sym in nvaligns[k]:
                sys.stderr.write("error: VALIGN conflict at {} {}\n".format(k, sym))
                exit_status = 1
            else:
                flags |= 4
        wfb  = False
        nwfb = False
        for (rule, index) in previous(vectors[j]):
            if wfbr[rule] and index + 1 == len(grammar[rule][1]):
                wfb = True
            if (rule,index) in wfbs[k]:
                wfb = True
            if (rule,index) in nwfbs[k]:
                nwfb = True
        if wfb:
            if nwfb:
                sys.stderr.write("error: WFB conflict at {} {}\n".format(k, sym))
                exit_status = 1
            else:
                flags |= 2
        if to_mapping in lr_index:
            tab[sym] = (flags, lr_index[to_mapping])
        else:
            tab[sym] = (flags, len(lr_mapping))
            lr_index[to_mapping] = len(lr_mapping)
            lr_mapping.append(to_mapping)
    had_conflicts = []
    for item in reductions[k]:
        for sym in followup(k, seed_lookahead, item):
            action = (1, item[0])
            if sym not in tab:
                tab[sym] = action
            else:
                if (mapping,sym) in lr_conflicts:
                    lr_conflicts[(mapping,sym)].append(item)
                elif tab[sym][0] == 1:
                    lr_conflicts[(mapping,sym)] = [
                        (tab[sym][1], len(grammar[tab[sym][1]][1])),
                        item ]
                    had_conflicts.append(sym)
                else:
                    lr_conflicts[(mapping,sym)] = (
                        list(previous(vectors[lr_mapping[tab[sym][1]][0]]))
                        + [item])
                    had_conflicts.append(sym)
    if len(had_conflicts) > 0:
        sys.stderr.write('LR conflicts: {}\n'.format(mapping))
        for sym in had_conflicts:
            print_itemset('  {}'.format(sym), lr_conflicts[(mapping,sym)],
                sys.stderr)
            if tab[sym][0] != 1:
                sys.stderr.write("  {}: shift to {}\n".format(sym,
                    lr_mapping[tab[sym][1]][0]))

As a bonus, here's how to build LALR(1) tables. You kind of collapse the tables together.

lalr_tabs = [{} for _ in seed_itemsets]
for i, vec in enumerate(lr_mapping):
    row = lalr_tabs[vec[0]]
    for key, action in lr_tabs[i].items():
        if action[0] == 1:
            rw_action = action
        else:
            to = lr_mapping[action[1]][0]
            rw_action = (action[0], to)
        if key not in row:
            row[key] = rw_action
        else:
            assert row[key] == rw_action, (
                "LALR table collision"
                "\nstate: {}"
                "\nrow[key]: {}"
                "\nrw_action: {}"
                ).format(vec[0], row[key], rw_action)

Notice that the parsing tables out of this one are different. I decided to just use pair of numbers for reductions and shifts. You derive a table from the grammar that provides reductions and counts.

I've written a parser generator with these features built-in. The code above is directly from this script. You can download it here: parsergen3.py

But does it work?

I can't be certain about whether the layout sensitivity works now. At least the reasoning behind it is a bit more solid than before.

People came with ordinary grammars but I didn't get material to test this on a layout-sensitive grammar. I guess we'll come up with something.

I built a grammar that used to be problematic for the parser generator:

lexemes identifier, number, string

program:
program: program VALIGN WFB procedure

procedure: procedure identifier
procedure: identifier

Lets see what the script thinks of this:

python parsergen3.py grammar -p -m -o language.py

First the LR0 itemsets come out, this is printed if -p flag is supplied.

0: ⊤ → ∘ program
   program → ∘
   program → ∘ program procedure
1: ⊤ → program ∘
   program → program ∘ procedure
   procedure → ∘ procedure identifier
   procedure → ∘ identifier
2: procedure → identifier ∘
3: program → program procedure ∘
   procedure → procedure ∘ identifier
4: procedure → procedure identifier ∘

We can verify they seem like right. Ok, since we gave -m -flag, it's giving whole bunch of information. Do the FIRST/VFIRST symbol sets seem like right?

FIRST
  string: 'string'
  number: 'number'
  program: 
  None: 
  identifier: 'identifier'
  procedure: 'identifier'
VFIRST
  string: 
  number: 
  program: 'identifier'
  None: 
  identifier: 
  procedure:

Why would the 'program' start with an 'identifier' that must be vertically aligned? If we examine the grammar, it actually does do that. This is bit of a problematic, but we may interpret it such that the first item must be aligned all the way to the left margin. That's not something I want though, so I'll change it later.

The FOLLOW sets could be printed out a bit better. The seed -set here tells which seed-itemsets "come through" and get their lookahead symbols into prediction items.

FOLLOW (SEED)
  0: {'program': set([(0, 0)]), None: set([(0, 0)])}
  1: {'identifier': set([(2, 1)]), 'procedure': set([(2, 1)]), None: set([(0, 1)])}
  2: {None: set([(4, 1)])}
  3: {'identifier': set([(3, 1)]), None: set([(2, 2)])}
  4: {None: set([(3, 2)])}

I got 'None' -productions in the follow set, that's not right. They don't have any effect but they clutter this up, but I fixed it into the script. Likewise we then got FOLLOW and VFOLLOW -sets. These seem about right, as the identifier is correctly identified as vertically aligning.

FOLLOW
  0: {'program': set([]), None: set([])}
  1: {'identifier': set(['identifier']), 'procedure': set(['identifier']), None: set([])}
  2: {None: set([])}
  3: {'identifier': set([]), None: set([])}
  4: {None: set([])}
FOLLOW (VALIGN)
  0: {'program': set(['identifier']), None: set(['identifier'])}
  1: {'identifier': set([]), 'procedure': set([]), None: set([])}
  2: {None: set([])}
  3: {'identifier': set([]), None: set([])}
  4: {None: set([])}

The WFB items are guaranteed to be shifted with layout constraint. The little amount of these and way they orient is a bit weird, but otherwise it seems correct.

WFB ITEMS
1: procedure → ∘ procedure identifier
   procedure → ∘ identifier
NWFB ITEMS
0: program → ∘
   program → ∘ program procedure

VALIGN/NVALIGN constraints seem correct as well. Finally we got that one WFBR rule that's marked correctly.

VALIGN LEXEMES
  0: set([])
  1: set(['identifier'])
  2: set([])
  3: set([])
  4: set([])
NVALIGN LEXEMES
  0: set([])
  1: set([])
  2: set([])
  3: set(['identifier'])
  4: set([])
WFBR RULES
  program → program procedure

The output came out like this, I readjusted line breaks a bit:

state = [
  { None: (1, 1), 'identifier': (1, 1), 'program': (0, 1)},
  { None: (1, 0), 'identifier': (6, 2), 'procedure': (2, 3)},
  { None: (1, 4), 'identifier': (1, 4), 'wfb': (1, 4)},
  { None: (1, 2), 'identifier': (0, 4), 'wfb': (1, 2)},
  { None: (1, 3), 'identifier': (1, 3), 'wfb': (1, 3)}]
rstate = [
    (None, 1),
    ('program', 0),
    ('program', 2),
    ('procedure', 2),
    ('procedure', 1) ]
keywords = []

The state table here seems like it is correct. '1' is for reduction and '0' is for shift, plus for shift there are few flags.

In the first state set it correctly recognizes it must reduce with 'program' and then shift to the second state.
In the second state we got shift with 4+2 -flags. So it requests VALIGN+WFB shift there, and with procedure it's just WFB shift. That's correct as well.
Third state seems to be plain shift because identifier forms a procedure.
Fourth state comes from second state after procedure has been found. Symbol falling under layout or the end of stream causes it to reduce, but identifier causes it to shift to the fifth state.
Fifth state deterministically does the reduction.

But does it parse anything? Well the driver is documented in the Idris code above right where it describes the valid state transitions.

class Parser:
    def __init__(self):
        self.stack = []
        self.top = 0
        self.layout = 0
        self.left = 1
p = Parser()

def get_action(sym, left):
    if p.layout < left or sym is None:
        return state[p.top][sym]
    else:
        return state[p.top]['wfb']

def advance(sym, text, start, stop):
    arg = get_action(sym, start[0])
    while arg[0] == 1:
        lhs, count = rstate[arg[1]]
        objects = []
        for _ in range(count):
            p.top, p.layout, p.left, obj = p.stack.pop()
            objects.append(obj)
        objects.reverse()
        reduced_object = reduction(arg[1], objects)
        if lhs is None:
            p.stack = []
            p.top = 0
            p.layout = 0
            p.left = 1
            # the 'accept' behavior
            pp = pprint.PrettyPrinter(indent=2)
            pp.pprint(objects[0])
            return
        else:
            arg = state[p.top][lhs]
            assert arg[0] != 1
            p.stack.append((p.top, p.layout, p.left, reduced_object))
            p.top = arg[1]
            if  arg[0] & 2 != 0: # WFB flag triggers layout.
                p.layout = p.left
            arg = get_action(sym, start[0])
    if arg[0] & 4 != 0: # VALIGN triggers a 'bonus' error.
        assert p.left == start[0], "layout check (line {})".format(start[1])
    p.stack.append((p.top, p.layout, start[0], text))
    p.top = arg[1]
    p.left = start[0]
    if arg[0] & 2 != 0: # WFB flag triggers layout.
        p.layout = start[0]

def reduction(rule, args):
    if rule == 1:
        return [':program']
    if rule == 2:
        return args[0] + [args[1]]
    if rule == 3:
        return args[0] + [args[1]]
    if rule == 4:
        return [':call', args[0]]
    return [rule] + args

In conjunction with a properly written tokenizer, it parses the input:

tok = Tokenizer()
for ch in fd.read():
    tokenize(tok, ch)
tokenize(tok, "")
advance(None, "", (0,0), (0,0))

Here's an input file:

Hello foo bar
  guux boob
blah blah blah

And here's a pretty printed output:

[ ':program',
  [':call', 'Hello', 'foo', 'bar', 'guux', 'boob'],
  [':call', 'blah', 'blah', 'blah']]

Also, like I predicted, the VALIGN-constraint gets triggered if the first symbol doesn't align to the first line.

That seems quite good, but now I resume to helping people with their parsers.

Bonus: A bit about LR(0) parsers

If you look at the LR0 itemsets and you find out that every reduction item is in its own itemset, then you have a grammar that can be parsed with a LR0 parser.

If you can build a LR0 parser, it means you can drop those itemsets that contain just reductions and replace every shift into them with a shift+reduction. This produces a grammar that has less states than itemsets.

Here's an example of such grammar:

lexemes identifier, string
sexpr: "(" sexprs ")"
     | "(" ")"
     | identifier
     | string
sexprs: sexpr
      | sexprs sexpr

You got to be careful that the itemsets actually turn out right:

0: ⊤ → ∘ sexpr
   sexpr → ∘ "(" sexprs ")"
   sexpr → ∘ "(" ")"
   sexpr → ∘ identifier
   sexpr → ∘ string
1: ⊤ → sexpr ∘                  <------
2: sexpr → identifier ∘         <------
3: sexpr → string ∘             <------
4: sexpr → "(" ∘ sexprs ")"
   sexpr → "(" ∘ ")"
   sexpr → ∘ "(" sexprs ")"
   sexpr → ∘ "(" ")"
   sexpr → ∘ identifier
   sexpr → ∘ string
   sexprs → ∘ sexpr
   sexprs → ∘ sexprs sexpr
5: sexprs → sexpr ∘             <------
6: sexpr → "(" ")" ∘            <------
7: sexpr → "(" sexprs ∘ ")"
   sexprs → sexprs ∘ sexpr
   sexpr → ∘ "(" sexprs ")"
   sexpr → ∘ "(" ")"
   sexpr → ∘ identifier
   sexpr → ∘ string
8: sexprs → sexprs sexpr ∘      <------
9: sexpr → "(" sexprs ")" ∘     <------

This trick can be also done with some LR parsers. If you find a state where the same rule reduces in every column, you can replace that state with a shift+reduce -jump.

I put arrows on every reduction, and you can see they're each in their own item set. If you were to build it, this parser would have only have 3 states!

0: sexpr:      shift+accept 'sexpr'
   "(":        shift 1
   identifier: shift+accept 'sexpr'
   string:     shift+accept 'sexpr'
1: ")":        shift+reduce 'sexpr'
   sexprs:     shift 2
   sexpr:      shift+reduce 'sexprs'
   "(":        shift 1
   identifier: shift+reduce 'sexpr'
   string:     shift+reduce 'sexpr'
2: ")":        shift+reduce 'sexpr'
   sexpr:      shift+reduce 'sexprs'
   "(":        shift 1
   identifier: shift+reduce 'sexpr'
   string:     shift+reduce 'sexpr'

You may wonder what would you possibly do with one like this...

Bonus 2: What's a SLR parser?

In SLR parser you take the FOLLOW -set, but instead of calculating it from itemsets calculate it from the grammar. This allows you to resolve by which rule to reduce or shift. SLR has as many states as you have itemsets.

The October offer still holds

If you contact me this October, I will help you build a LR-parser for your language. Afterwards I will feature your language in this blog if you like to.

It doesn't need to be a programming language, see the previous week's post for some ideas, and the parser can be written in any language that you're aware of.

I adjusted the submission grammar a bit, mainly instead of using the layouting to separate rules, I expect you use vertical columns as it's cleaner and easier to parse. :)

lexemes identifier, string
grammar_0:
grammar_0: grammar
grammar: WFB declaration
       | grammar VALIGN WFB declaration
declaration: identifier ":" rhs_block
           | identifier ":"
           | "lexemes" identifiers
rhs_block: rhs
         | rhs_block "|" rhs
rhs: rhs_1
rhs_1: symbol
     | rhs_1 symbol
symbol: "VALIGN" symbol
      | "WFB" symbol
      | identifier
      | string
identifiers: identifier
           | identifiers "," identifier

Though everybody presenting their languages have had their own formats so far. If you write a grammar for a driver and change the parser a bit, you can modify the parsergen to accept custom grammar files. I may add more features into the script to make this easier.

There are plenty of things people can do with LR and LALR parsers. Since they are only state machines with stacks, you can potentially even use them in memory constrained environments. For example if you'd like to have your parser work in C64? No problem, make it work with LALR(1), compact the tables as small as they go, pre-tokenize the input so the programs don't take all the space, make the language as simple as it can be. You get the whole thing packed into 2 kilobytes or less.