Right recursion at Marpa parser
Here are the all parsing related blog posts I have written so far:
- Syntax for a programming language
- LR(1) parsers
- Challenges in parsing c to get bindings
- Document schema in textended
- Context free grammar as a schema
- User interface parsing, about derivatives based parsing
- Pyllisp syntax
- LR & GLL parsers study
- Marpa: parsing revolution
- Pyllisp grammar (and parser) slow but flexible
- Fundamental piece of my compiler
- Introducing Lever programming language
- Chart parser solves the solved problem again
- Finicky C preprocessing and parsing details
The shape and order in this list hints towards a progression and a story. It tells about me:
- Realising that hand-parsers aren't good for changing designs.
- Studying parsing and getting disappointed.
- Trying visual programming approaches and discarding them because I was not getting rid of the parsing problem this way.
- Giving a lisp syntax for my study language as an intermediate solution.
- Further studying parsing.
- Discovering marpa.
- Applying it to solve my parsing problems.
Marpa parser is an Earley parser with all the improvements to it aggregated together. There's one thing about this chart based 'marpa' parsing approach that I didn't quite understood until now.
Earley parsing cannot handle right recursive grammars very well. To point it out I'm using a simple grammar to parse a short expression with Earley parser algorithm.
Here's the grammar:
exp -> expr '^' exp
exp -> expr
Here's the expression:
expr '^' expr '^' expr '^' expr
Earley parser creates 'sets'. This is the very first set it creates:
°expr '^' expr '^' expr '^' expr
0:
0: exp -> ° expr '^' exp
0: exp -> ° expr
It goes all good to the first add-expression:
expr '^' expr ° '^' expr '^' expr
3:
2: exp -> expr ° '^' exp
2: exp -> expr °
0: exp -> expr '^' exp °
Then the chart starts to grow:
expr '^' expr '^' expr ° '^' expr
5:
4: exp -> expr ° '^' exp
4: exp -> expr °
2: exp -> expr '^' exp °
0: exp -> expr '^' exp °
And the chart grows every time the right recursive rule is reduced:
expr '^' expr '^' expr '^' expr °
7:
6: exp -> expr ° '^' exp
6: exp -> expr °
4: exp -> expr '^' exp °
2: exp -> expr '^' exp °
0: exp -> expr '^' exp °
This means that Earley parser is slowing down at long sequences of right recursive parses like this. Currently this is true for my Lever parser as well.
Marpa doesn't have problems in parsing with right recursive grammars. It achieves this with Leo silos and items, named after Joop Leo who came up with a method to optimize right recursion on Earley parsing in 1991.
To implement this, you define leo-egilible earley items. When processing this kind of items we can do a trick and cache only the last result of repeated reductions. Here's a leo egilible item from the parsing example we just had:
4: exp -> expr '^' ° exp
This is leo egilible because...
- It has right-recursive rule.
- It is quasi-complete (the dot is one shift away from completion)
- It is postdot unique in its set.
The last rule means that this earley is in an earley set where the symbol it would shift with (° exp) is unique.
To see this is true, here's the earley set where that rule was:
expr '^' expr '^' expr '^' ° expr
6:
4: exp -> expr '^' ° exp
6: exp -> ° expr '^' exp
6: exp -> ° expr
If we shift from this earley set with 'exp', we will get only one result that is the:
4: exp -> expr '^' exp °
This allows us to compress the reductions in these items because they are repeating.
Here's the same parse as earlier, but with leo silos provided:
°expr '^' expr '^' expr '^' expr
0:
0: exp -> ° expr '^' exp
0: exp -> ° expr
expr ° '^' expr '^' expr '^' expr
1:
0: exp -> expr ° '^' exp
0: exp -> expr °
expr '^' ° expr '^' expr '^' expr
2:
0: exp -> expr '^' ° exp (L)
2: exp -> ° expr '^' exp
2: exp -> ° expr
expr '^' expr ° '^' expr '^' expr
3:
2: exp -> expr ° '^' exp
2: exp -> expr °
0: exp -> expr '^' exp °(L)
expr '^' expr '^' ° expr '^' expr
4:
2: exp -> expr '^' ° exp (L:1, starts at 2)
4: exp -> ° expr '^' exp
4: exp -> ° expr
expr '^' expr '^' expr ° '^' expr
5:
4: exp -> expr ° '^' exp
4: exp -> expr °
2: exp -> expr '^' exp °(L*)
0: exp -> expr '^' exp °(L)
expr '^' expr '^' expr '^' ° expr
6:
4: exp -> expr '^' ° exp (L)
6: exp -> ° expr '^' exp
6: exp -> ° expr
expr '^' expr '^' expr '^' expr °
7:
6: exp -> expr ° '^' exp
6: exp -> expr °
4: exp -> expr '^' exp °(L*)
2: exp -> expr '^' exp °(L*)
0: exp -> expr '^' exp °(L)
You can see that if you remove the reduction inferenced items (marked L*), the set 7 has not grown beyond what set 5 was. After this example I weren't yet entirely sure how it is going out there. The overall idea is that we can 'silo' and cut-off in a reduction path because we know where the reduction leads to.
Lets try with a different right-recursive example:
a -> x b
a -> x
b -> y a
The grammar makes up a language that is very much like the earlier one. But this one is not directly right-recursive. Here's the parse chart:
°x y x y x
0:
0: a -> ° x b
0: a -> ° x
x°y x y x
1:
0: a -> x ° b (L)
0: a -> x °
1: b -> ° y a
x y°x y x
2:
1: b -> y ° a (L)
2: a -> ° x b
2: a -> ° x
x y x°y x
3:
2: a -> x ° b (L)
2: a -> x °
3: b -> ° y a
1: b -> y a °(L*)
0: a -> x b °(L)
x y x y°x
4:
3: b -> y ° a (L)
4: a -> ° x b
4: a -> ° x
x y x y x°
5:
4: a -> x ° b
4: a -> x °
3: b -> y a °(L*)
2: a -> x b °(L*)
1: b -> y a °(L*)
0: a -> x b °(L)
The reduction path here is even more blatantly apparent. Also it is blatantly deterministic and redundant. If you know that 4: a -> x ° at set 5 led to deterministic reduction, you'll know how the story goes until it hits the 0.
There is two more interesting variations of the above case. Lets consider the following grammar:
a -> x a y
a -> x a
a -> x
And the parse:
°x
0:
0: a -> ° x a y
0: a -> ° x a
0: a -> ° x
x°
1:
0: a -> x ° a y
0: a -> x ° a
0: a -> x °
The rules in this example are not forming leo-egilible items anymore because the item cannot be postdot unique. This means that there is no longer a reduction path of determistic flavor.
Lets try yet one grammar:
a -> a y
a -> x a
a -> x
Parse:
°x x
0:
0: a -> ° a y
0: a -> ° x a
0: a -> ° x
x°x
1:
0: a -> x ° a (L)
0: a -> x °
0: a -> a ° y
1: a -> ° a y
1: a -> ° x a
1: a -> ° x
x x°
2:
1: a -> x ° a (L)
1: a -> x °
1: a -> a ° y
0: a -> x a °(L)
0: a -> a ° y
1: a -> a ° y
Parsing with this grammar gets complicated quickly, but the deterministic reduction rule here appears to keep working just fine.
The postdot-unique -restriction to leo-item-egilibility forces the algorithm to be correct. It singles out every case where the reduction becomes too complex to be memoized.
Loup Vaillant has written some tutorials on earley parsing.
If you're interested about Marpa overall, there's Jeffrey Kegler's Marpa parser website.