Fundamental piece of my compiler

This is an overview of earley/marpa parsing, a fundamental unit of most of my existing and upcoming compilers. I got introduced to this method of parsing through the good work of Jeffrey Kegler. I've written about this before, though now I've learned some more.

An earley parser takes a grammar and input tokens as an input, and produces a parse tree. It makes a great and often sufficient parser before you've got to put up with harder designs.

Rewriting nulls out of the grammar

Lot of effort in getting Earley algorithm to work can be averted by parsing with grammars that lack production rules capable of parsing an empty string. It ensures predictions cannot result in new completions.

Empty production rules can be isolated and rewritten out of the grammar. Production rules that contain nullable symbols are expanded by going through all non-empty permutations of the rule. An example:

C ->   ','
C ->   ',' B
C -> B ','  
C -> A ',' B

Notice how it resembles counting with binary numbers.

Earley items

An earley item consists of a dotted rule and origin.

A parser step

The parser is invoked for every token in the input. Every invocation results in a new earley set.

First a scan happens. The scan grabs every earley item that accepts the input token and advances them. The scan may produce earley items that were completed. Each completion grabs earley items further.

Once every scan and completion has been made, the prediction step can occur. The prediction produces new earley items.

SPPF -tree

If not taken care of early, the construction of parse tree from recognized earley sets easily take longer than the earley parsing itself takes.

The SPPF -tree is formed by producing links from new earley items into an item that were advanced and the item that it was advanced with.

Post-order tree traversal

If SPPF-tree is unambiguous, it is essentially a binary tree. You can post-order traverse it to evaluate the parse.

When evaluating, be sure to use explicit stack for storing right edges. Long parses will easily overflow any fixed-size stack.

Earlier I also had pre-order invocations. They turned out to be occassionally useful, but make the system in whole harder to understand.