Right recursion at Marpa parser

Here are the all parsing related blog posts I have written so far:

The shape and order in this list hints towards a progression and a story. It tells about me:

Marpa parser is an Earley parser with all the improvements to it aggregated together. There's one thing about this chart based 'marpa' parsing approach that I didn't quite understood until now.

Earley parsing cannot handle right recursive grammars very well. To point it out I'm using a simple grammar to parse a short expression with Earley parser algorithm.

Here's the grammar:

exp -> expr '^' exp
exp -> expr

Here's the expression:

expr '^' expr '^' expr '^' expr

Earley parser creates 'sets'. This is the very first set it creates:

°expr  '^'   expr   '^'   expr   '^'   expr
0:
   0: exp -> ° expr   '^'   exp
   0: exp -> ° expr

It goes all good to the first add-expression:

expr   '^'   expr ° '^'   expr   '^'   expr
3:
   2: exp ->   expr ° '^'   exp
   2: exp ->   expr °
   0: exp ->   expr   '^'   exp °

Then the chart starts to grow:

expr   '^'   expr   '^'   expr ° '^'   expr
5:
   4: exp ->   expr ° '^'   exp
   4: exp ->   expr °
   2: exp ->   expr   '^'   exp °
   0: exp ->   expr   '^'   exp °

And the chart grows every time the right recursive rule is reduced:

expr   '^'   expr   '^'   expr   '^'   expr °
7:
   6: exp ->   expr ° '^'   exp
   6: exp ->   expr °
   4: exp ->   expr   '^'   exp °
   2: exp ->   expr   '^'   exp °
   0: exp ->   expr   '^'   exp °

This means that Earley parser is slowing down at long sequences of right recursive parses like this. Currently this is true for my Lever parser as well.

Marpa doesn't have problems in parsing with right recursive grammars. It achieves this with Leo silos and items, named after Joop Leo who came up with a method to optimize right recursion on Earley parsing in 1991.

To implement this, you define leo-egilible earley items. When processing this kind of items we can do a trick and cache only the last result of repeated reductions. Here's a leo egilible item from the parsing example we just had:

4: exp -> expr '^' ° exp

This is leo egilible because...

The last rule means that this earley is in an earley set where the symbol it would shift with (° exp) is unique.

To see this is true, here's the earley set where that rule was:

expr   '^'   expr   '^'   expr   '^' ° expr
6:
   4: exp ->   expr   '^' ° exp
   6: exp -> ° expr   '^'   exp
   6: exp -> ° expr

If we shift from this earley set with 'exp', we will get only one result that is the:

4: exp ->   expr   '^'   exp °

This allows us to compress the reductions in these items because they are repeating.

Here's the same parse as earlier, but with leo silos provided:

°expr  '^'   expr   '^'   expr   '^'   expr
0:
   0: exp -> ° expr   '^'   exp
   0: exp -> ° expr

expr ° '^'   expr   '^'   expr   '^'   expr
1:
   0: exp ->   expr ° '^'   exp
   0: exp ->   expr °

expr   '^' ° expr   '^'   expr   '^'   expr
2:
   0: exp ->   expr   '^' ° exp (L)
   2: exp -> ° expr   '^'   exp
   2: exp -> ° expr

expr   '^'   expr ° '^'   expr   '^'   expr
3:
   2: exp ->   expr ° '^'   exp
   2: exp ->   expr °
   0: exp ->   expr   '^'   exp °(L)

expr   '^'   expr   '^' ° expr   '^'   expr
4:
   2: exp ->   expr   '^' ° exp (L:1, starts at 2)
   4: exp -> ° expr   '^'   exp
   4: exp -> ° expr

expr   '^'   expr   '^'   expr ° '^'   expr
5:
   4: exp ->   expr ° '^'   exp
   4: exp ->   expr °
   2: exp ->   expr   '^'   exp °(L*)
   0: exp ->   expr   '^'   exp °(L)

expr   '^'   expr   '^'   expr   '^' ° expr
6:
   4: exp ->   expr   '^' ° exp (L)
   6: exp -> ° expr   '^'   exp
   6: exp -> ° expr

expr   '^'   expr   '^'   expr   '^'   expr °
7:
   6: exp ->   expr ° '^'   exp
   6: exp ->   expr °
   4: exp ->   expr   '^'   exp °(L*)
   2: exp ->   expr   '^'   exp °(L*)
   0: exp ->   expr   '^'   exp °(L)

You can see that if you remove the reduction inferenced items (marked L*), the set 7 has not grown beyond what set 5 was. After this example I weren't yet entirely sure how it is going out there. The overall idea is that we can 'silo' and cut-off in a reduction path because we know where the reduction leads to.

Lets try with a different right-recursive example:

a -> x b
a -> x
b -> y a

The grammar makes up a language that is very much like the earlier one. But this one is not directly right-recursive. Here's the parse chart:

°x y x y x
0:
    0: a -> ° x   b
    0: a -> ° x

x°y x y x
1:
    0: a ->   x ° b (L)
    0: a ->   x °
    1: b -> ° y   a

x y°x y x
2:
    1: b ->   y ° a (L)
    2: a -> ° x   b
    2: a -> ° x

x y x°y x
3:
    2: a ->   x ° b (L)
    2: a ->   x °
    3: b -> ° y   a
    1: b ->   y   a °(L*)
    0: a ->   x   b °(L)

x y x y°x
4:
    3: b ->   y ° a (L)
    4: a -> ° x   b
    4: a -> ° x

x y x y x°
5:
    4: a ->   x ° b
    4: a ->   x °
    3: b ->   y   a °(L*)
    2: a ->   x   b °(L*)
    1: b ->   y   a °(L*)
    0: a ->   x   b °(L)

The reduction path here is even more blatantly apparent. Also it is blatantly deterministic and redundant. If you know that 4: a -> x ° at set 5 led to deterministic reduction, you'll know how the story goes until it hits the 0.

There is two more interesting variations of the above case. Lets consider the following grammar:

a -> x a y
a -> x a
a -> x

And the parse:

°x
0:
    0: a -> ° x   a   y
    0: a -> ° x   a
    0: a -> ° x

x°
1:
    0: a ->   x ° a   y
    0: a ->   x ° a
    0: a ->   x °

The rules in this example are not forming leo-egilible items anymore because the item cannot be postdot unique. This means that there is no longer a reduction path of determistic flavor.

Lets try yet one grammar:

a -> a y
a -> x a
a -> x

Parse:

°x x
0:
    0: a -> ° a   y
    0: a -> ° x   a
    0: a -> ° x

x°x
1:
    0: a ->   x ° a (L)
    0: a ->   x °
    0: a ->   a ° y
    1: a -> ° a   y
    1: a -> ° x   a
    1: a -> ° x

x x°
2:
    1: a ->   x ° a (L)
    1: a ->   x °
    1: a ->   a ° y
    0: a ->   x   a °(L)
    0: a ->   a ° y
    1: a ->   a ° y

Parsing with this grammar gets complicated quickly, but the deterministic reduction rule here appears to keep working just fine.

The postdot-unique -restriction to leo-item-egilibility forces the algorithm to be correct. It singles out every case where the reduction becomes too complex to be memoized.

Loup Vaillant has written some tutorials on earley parsing.

If you're interested about Marpa overall, there's Jeffrey Kegler's Marpa parser website.