Error correcting Earley parsing
In this post I explain how Alfred V. Aho's and T. J. Peterson's error-correcting algorithm works.
"A Minimum-Distance Error-Correcting Parser for Context-Free Languages" was published at Society for Industrial and Applied Mathematics in 1972. It introduces a method to parse any string to completion with fewest possible number of errors measured with Levehstein distance.
I also present a potential way to speed it up through pruning out results.
Context-free grammars
Context-free grammar is a description of a language. It consists of production rules and each production rule has a left and a right side.
Production rule is written like this:
element → LB block RB
This would mean that "element" can be expanded into (LB block RB)
Now if in that same grammar you have some production rule for 'block',
for example..
block → TEXT
Now you can compose these rules and you get:
element ⇒ LB TEXT RB
If "element" is a start symbol for the grammar, then this grammar recognizes a string "LB TEXT RB".
The context-free grammar may be also presented by so called "4-tuple". In this tuple you have:
- set of nonterminals used
- set of terminals used
- set of these production rules belonging to the grammar
- and a start symbol
In Python you could represent a CFG like this:
grammar = [
(None, ['element']),
('element', ['TEXT']),
('element', ['LB', 'block']),
('element', ['LP', 'attribute']),
('element', ['AT', 'location']),
('block', ['RB']),
('block', ['element', 'block']),
('attribute', ['RP']),
('attribute', ['TEXT', 'RP']),
('attribute', ['SEP', 'attribute']),
('attribute', ['TEXT', 'SEP', 'attribute']),
('location', ['TA']),
('location', ['LP', 'attribute', 'location']),
('location', ['TEXT', 'location']),
('location', ['SEP', 'location']) ]
terminals = ['AT', 'LB', 'LP', 'RB', 'RP', 'SEP', 'TA', 'TEXT']
Now you understand context-free grammars if you didn't already.
Items
Item is a progress within a production rule, it splits it into three parts. Here's examples of items:
element → ∘ LB block RB
element → LB ∘ block RB
element → LB block ∘ RB
element → LB block RB ∘
block → ∘ TEXT
block → TEXT ∘
To illustrate how this works. Lets start with an example.
Simple example of Earley parsing
Lets take the simple grammar from the earlier:
element → LB block RB
block → TEXT
The recognized symbol is the element
.
Now we start parsing the only symbol recognized by this grammar,
the LB TEXT RB
.
We start from the left and mark it down with a dot:
∘ LB TEXT RB
∘ element
We also mark that we're parsing the element here.
The parsing cannot progress until we expand the element
.
This is done by writing an item that predicts the element.
∘ LB TEXT RB
∘ element
element → ∘ LB block RB
Now there's one way how this can progress, so we can step it forward.
LB ∘ TEXT RB
element → LB ∘ block RB (0)
Note that we had to mark into the item where we started it. It was before we had parsed anything so we marked the item with 0.
block
cannot match with TEXT
directly,
so another prediction must be done.
LB ∘ TEXT RB
element → LB ∘ block RB (0)
block → ∘ TEXT
Now we can progress again.
LB TEXT ∘ RB
block → TEXT ∘ (1)
Because the dot has reached the end of the rule,
it means we've found a block
in the text.
Lets index the itemsets so far:
[0] ∘ LB TEXT RB
∘ element
element → ∘ LB block RB
[1] LB ∘ TEXT RB
element → LB ∘ block RB (0)
block → ∘ TEXT
[2] LB TEXT ∘ RB
block → TEXT ∘ (1)
We have [2]: block → TEXT ∘ (1)
and [1]: element → LB ∘ block RB (0)
.
This means that we can add [2]: element → LB block ∘ RB (0)
.
[2] LB TEXT ∘ RB
block → TEXT ∘ (1)
element → LB block ∘ RB (0)
And we can progress again.
[3] LB TEXT RB ∘
element → LB block RB ∘ (0)
element ∘
The element
has a reduction that spans through the whole text.
Precise description of Earley parsing
Earley can be described as repetition of these rules until no more new items can be added:
Prediction rule.
If you have an item [k]: S → x ∘ A y (n)
,
and rule A → z
,
add item [k]: A → ∘ z (k)
.
Reduction rule.
If you have item [k]: A → x ∘ (j)
,
and you have item [j]: C → y ∘ A z (n)
,
add item [k]: C → y A ∘ z (n)
.
Scanning rule.
If you can scan with terminal A
at index [k]
,
and you have item [k]: C → x ∘ A y
.
add item [k+1]: C → x A ∘ y
.
In these rules, small letters in production rule mean for zero or more terminals/nonterminals.
Levehstein-distance-weighted context-free grammar
The paper describes a transformation to the grammar. This transformation provides a grammar to match strings that can be rewritten into a shape that matches.
First we introduce a new nonterminal for every terminal in the original grammar.
The new grammar copies the production rules from the original grammar and replaces every terminal symbol occuring in them with a corresponding nonterminal symbol. It also adds few rules.
The starting symbol is replaced by a new starting symbol, and the following rules are added for it:
new_start → old_start
new_start → old_start .+
{1} .+ → .+ .
{1} .+ → .
There are few concepts introduced here, not already in context-free grammars.
The first is that production rules may be weighted. The two last rules have a weight here. They correspond to discarding symbols from the end of the input string.
The dot "." appearing lone in a symbol is a "catch-all" -terminal. We are allowed to scan with it on every symbol.
We finally have a concept of "!terminal" that matches every other terminal except the terminal that comes after the ! -symbol.
Now for every terminal in the original grammar, the following new production rules are introduced:
<terminal> → terminal
<terminal> → .+ terminal
{1} <terminal> →
{1} <terminal> → !terminal
The second rule captures discarding of a symbol before matching this terminal. The third rule captures insertion of the terminal, so it doesn't appear in the input. The fourth rule captures the replacement of the terminal with something else.
Ok, so next we can get to an example. Lets rewrite the following grammar:
element → LB block RB
block → TEXT
We get the following new grammar:
element → <LB> block <RB>
block → <TEXT>
start → element
start → element .+
{1} .+ → .+ .
{1} .+ → .
<LB> → LB
<LB> → .+ LB
{1} <LB> →
{1} <LB> → !LB
<RB> → RB
<RB> → .+ RB
{1} <RB> →
{1} <RB> → !RB
<TEXT> → TEXT
<TEXT> → .+ TEXT
{1} <TEXT> →
{1} <TEXT> → !TEXT
Now we can try parsing again. Lets do the same as earlier.
Simple sample of weighted Earley parsing
The first step repeated with the earlier rules yields this:
[0] ∘ LB TEXT RB
∘ start
start → ∘ element
start → ∘ element .+
element → ∘ <LB> block <RB>
<LB> → ∘ LB
<LB> → ∘ .+ LB
{1} <LB> → ∘
{1} <LB> → ∘ !LB
{1} .+ → ∘ .+ .
{1} .+ → ∘ .
{1} element → <LB> ∘ block <RB>
block → ∘ <TEXT>
<TEXT> → ∘ TEXT
<TEXT> → ∘ .+ TEXT
{1} <TEXT> → ∘
{1} <TEXT> → ∘ !TEXT
{1} block → <TEXT> ∘
{2} element → <LB> block ∘ <RB>
<RB> → ∘ RB
<RB> → ∘ .+ RB
{1} <RB> → ∘
{1} <RB> → ∘ !RB
{3} element → <LB> block <RB> ∘
{3} start → element ∘
{3} start ∘
But now there's bit more going on. First of all we already accept the string, with the cost of 3. Lets scan the next terminal.
[1] LB ∘ TEXT RB
<LB> → LB ∘ (0)
element → <LB> ∘ block <RB> (0)
block → ∘ <TEXT>
<TEXT> → ∘ TEXT
<TEXT> → ∘ .+ TEXT
{1} <TEXT> → ∘
{1} <TEXT> → ∘ !TEXT
{1} .+ → ∘ .+ .
{1} .+ → ∘ .
{1} .+ → . ∘ (0)
{1} <TEXT> → !<TEXT> ∘ (0)
{1} <RB> → !<RB> ∘ (0)
{1} .+ → .+ ∘ . (0)
{1} <LB> → .+ ∘ LB (0)
{1} <RB> → .+ ∘ RB (0)
{1} <TEXT> → .+ ∘ TEXT (0)
{1} block → <TEXT> ∘
{1} element → <LB> block ∘ <RB> (0)
<RB> → ∘ RB
<RB> → ∘ .+ RB
{1} <RB> → ∘
{1} <RB> → ∘ !RB
{2} element → <LB> block <RB> ∘ (0)
{2} start → element ∘ (0)
{2} start ∘ (0)
In the worst case Earley algorithm takes n*n*n
times where n
is the length of the input string.
Eg. If it takes a second to parse 10 symbols
then you spend 16 minutes on 100 symbols.
We follow fairly simple rules here and all of them are applied in the above examples. So I leave it there.
I didn't enumerate which rules were applied, but they're same as earlier with rule costs taken into account during reduction.
Minimum-cost "weighted" Earley parsing
The weighted earley parsing adds some additional rules into the ordinary earley parsing.
In the prediction rule, if there's a cost on the production rule, the produced item gets that cost.
In the reduction rule, the produced item's cost is a sum of the previously produced item costs.
In the scanning rule the item cost is carried over, items with "." are always matching any terminal, and negative matching is applied.
Lower-cost items replace those with higher-cost items. Alternatively lower-cost items are introduced first, like how Dijikstra's algorithm works.
Precisely the new rules are:
Prediction rule.
If you have an item [k]: {a} S → x ∘ A y (n)
,
and rule {c} A → z
,
add item [k]: {c} A → ∘ z (k)
.
Reduction rule.
If you have item [k]: {a} A → x ∘ (j)
,
and you have item [j]: {b} C → y ∘ A z (n)
,
add item [k]: {a+b} C → y A ∘ z (n)
.
Scanning rule.
If you can scan with terminal A
at index [k]
,
and you have item [k]: {a} C → x ∘ A y
.
add item [k+1]: {a} C → x A ∘ y
.
Scanning rule (Any).
If you can scan with terminal A
at index [k]
,
and you have item [k]: {a} C → x ∘ . y
.
add item [k+1]: {a} C → x . ∘ y
.
Scanning rule (Inverted).
If you can scan with terminal A
at index [k]
,
and you have item [k]: {a} C → x ∘ !B y
,
and A
is not B
,
add item [k+1]: {a} C → x !B ∘ y
.
Some observations
It is interesting that this algorithm would appear to not be much worse than Earley, in the sense that it only brings out the worst-case behavior that was already present.
With a short input even the unmodified earlier can be useful, but in most cases this algorithm exposes the worst of Earley parsing.
There's an one interesting observation. When I keep running it for random input I find out it's building this kind of groups:
block → element ∘ block (0) cost=80
block → element ∘ block (1) cost=80
block → element ∘ block (2) cost=79
block → element ∘ block (3) cost=79
block → element ∘ block (4) cost=79
block → element ∘ block (5) cost=78
block → element ∘ block (6) cost=77
block → element ∘ block (7) cost=77
block → element ∘ block (8) cost=77
block → element ∘ block (9) cost=77
block → element ∘ block (10) cost=77
block → element ∘ block (11) cost=76
block → element ∘ block (12) cost=76
block → element ∘ block (13) cost=75
...
It's not the right-recursion here though. If I replace it with left recursion I get this:
block → element* ∘ <RB> (0) cost=67
element* → element* ∘ element (0) cost=67
block → element* ∘ <RB> (1) cost=66
element* → element* ∘ element (1) cost=66
block → element* ∘ <RB> (2) cost=66
element* → element* ∘ element (2) cost=66
block → element* ∘ <RB> (3) cost=65
element* → element* ∘ element (3) cost=65
block → element* ∘ <RB> (4) cost=66
element* → element* ∘ element (4) cost=66
block → element* ∘ <RB> (5) cost=65
element* → element* ∘ element (5) cost=65
...
If you limit the amount of error you accept, the algorithm gets faster. However if you then proceed and give it input that is closer to expected but still contains errors, the performance is about the same as before.
Limiting the amount of edit-distance errors means that the program no longer accepts every input. We want to see how far away the program is from being syntactically valid.
The item groups give a cue for how to improve the algorithm.
Pruning of items to speed it up
Once an itemset gets completed, it can be pruned down by the cost that has been accumulated on the item. We can define a branching limit that is some constant number along the grammar.
The prediction -step in the algorithm can be made to store a cost for the nonterminal that is used for prediction. This number predicts the amount of error in accepted state, assuming the expression is completed without errors.
[k]: {c} LHS → a ∘ A b (j)
totalcost = rcost(LHS, j) + c
Mathematically the rcost
is the minimum sum from taking a rule matching:
[j] {d} X → x LHS y (m)
rcost(LHS, j) = d + rcost(X, m)
...
rcost(Accept, 0) = 0
Now we can take the totalcost
and use it to limit the amount of items
that would be advanced as the result of scanning A
.
If there are more than the branching limit,
drop a highest cost branch until a limit is reached.
In the test input of 250 randomized symbols, without pruning it takes 1:50 minutes to parse the text.
If I limit the total amount of items passing scan to 6, it takes 3 seconds accepting the input with same edit distance cost as the original algorithm.
If I limit the total amount of items passing scan to 3, I get 1.64 times off from the optimal result, but it only takes 0.9 seconds to compute. 4 items again take 3 seconds.
The best number of items to be left after pruning varies based on input. Though in every case we seem to obtain close approximates of edit distances.
This algorithm is not linear-time though. I see it slowing down larger the input gets. Though this time it would appear to be due to the right recursion in grammar.
Conclusion
The idea of speeding this algorithm up seems promising but it probably needs further studying, if it is even doing anything different than what local error correction would do.