Error correction in LR parsing

This time we dig into error correction in LR parsers. This is a blind dig where I browsed through several papers describing error repairing algorithms until I found this:

"A Practical Minimum Distance Method for Syntax Error Handling"

J. A. Dain

It seems like a nice paper that describes how to implement an error recovery that attempts to find a smallest repair that causes the parsing to pass.

I didn't know a lot about this algorithm and I don't know how it performs in practice.

I get the impression that the algorithm presented in the paper can produce error reports such as these:

Line 5: syntax error
4 * ( 5 2 1 ) + 3 + 2
--------^    replace with '+'

Line 35: syntax error
            if number > 1
------------^
inserted ';'

Line 234: syntax error
procedure factorial (a);
----------------------^ insert ':'
-----------------------^ insert 'identifier'

Alternatively it is able to produce repaired printouts:

Line 10 replaced by 'while a[1] do begin'

Error correction allows parser to continue through syntax errors.

Why pass through syntax errors?

Levenshtein edit distance

Before going to look at the algorithm in the paper, lets look at a simpler, similar program.

Levenshtein distance is some number related to two strings, that expresses how many insertions and removals somebody would have to do in order to rewrite a string into an another.

The algorithm we are about to examine is based on Wagner-Fischer algorithm, that is a dynamic programming algorithm.

'Dynamic programming' means that we memoize previously computed values to compute new values from them.

Here's the algorithm:

def editdistance(s, t):
    m = len(s)
    n = len(t)
    d = [[i for j in range(n+1)] for i in range(m+1)]
    for j in range(1, n+1):
        d[0][j] = j
        for i in range(1, m+1):
            substitution_cost = int(s[i-1] != t[j-1])
            d[i][j] = min( 
                d[i-1][j] + 1,                        # deletion
                min(d[i][j-1] + 1,                    # insertion
                    d[i-1][j-1] + substitution_cost)) # substitution
    return d[m][n]

To illustrate how this works, lets find the edit distance between 'wagner' and 'fischer'.

Before entering the loop we've got this kind of a cost matrix.

[0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1]
[2, 2, 2, 2, 2, 2, 2, 2]
[3, 3, 3, 3, 3, 3, 3, 3]
[4, 4, 4, 4, 4, 4, 4, 4]
[5, 5, 5, 5, 5, 5, 5, 5]
[6, 6, 6, 6, 6, 6, 6, 6]

Only the first column here means anything, the rest is garbage. To make any sense of this, we can overlay the strings:

    f i s c h e r
  0 1 2 3 4 5 6 7
w 1
a 2
g 3
n 4
e 5
r 6

The starting point of the algorithm is that we know the edit distance if either string is empty. It is max(m,n).

We have a way to compute the shortest edit if we have a corner:

    f
  0 1
w 1

There are three ways to calculate the edit distance.

  1. If letter 'w' was removed, we take a cost of '1' and take the base cost from above cell.
  2. If letter 'f' was inserted, take base cost from left cell.
  3. If letter was replaced, take replacement cost '1' and take base cost from left above.

In the corner the cheapest choice was the third option, with cost of '1', so we continue with that and get to look at subsequent squares that opened:

    f i
  0 1 2
w 1 1
a 2

If doing this in a simple way, like the python script does, we continue filling up downwards:

    f i s c h e r
  0 1 2 3 4 5 6 7
w 1 1
a 2 2
g 3 3
n 4 4
e 5 5
r 6 6

Eventually it finds out that the edit distance is 5.

    f i s c h e r
  0 1 2 3 4 5 6 7
w 1 1 2 3 4 5 6 7
a 2 2 2 3 4 5 6 7
g 3 3 3 3 4 5 6 7
n 4 4 4 4 4 5 6 7
e 5 5 5 5 5 5 5 6
r 6 6 6 6 6 6 6 5

You perhaps already figured out how to read this matrix. The edit distance between 'wagner' and 'fischer' is '5' because that reads in the lowest rightmost corner.

And now we find out edits that produce the result:

    f i s c h e r
  0 1            
w   1 2          
a     2 3        
g       3 4      
n         4 5    
e             5  
r               5

Now there are several paths through here, lets pick few.

add 'f'
replace 'wagn' with 'isch'

Here's an another.

replace 'wa' with 'fi'
add 's'
replace 'gn' with 'sc'

And yet another.

replace 'wagn' with 'fisc'
add 'h'

All of these have total levenshtein edit cost of '5'.

Algorithm from the paper

The error-correcting parser described in the paper switches to recovery mode whenever parsing cannot progress.

The algorithm searches for a minimal repair sequence such as earlier. Ideally it would empty the incoming stream and search for a shortest repair among every valid way to continue the parse. This is not practical however.

To make a practical method that follows that ideal, the algorithm places a limit on the length of the repair string and the prefixes are measured against a fixed number of incoming symbols. Longest repair with lowest score is selected.

To understand it better, lets look at the Table IV in the paper that displays "continuation strings" and "minimum distances".

      │  id id ) + id +
──────┼────────────────
+ ( id│3 2  2  3 4 4  5
+ ( ( │3 3  3  3 4 5  5
) $   │2 2  2  3 3 4  5
) * id│3 2  2  3 4 3  4
) * ( │3 3  3  3 4 4  5
+ id *│3 2  2  2 3 4  4
) + id│3 2  2  3 3 2  3
) + ( │3 3  3  3 3 3  4
+ id )│3 2  2  1 2 3  4
+ id +│3 2  2  2 2 3  3

The left column contains repair string candinates. On the right column there are edit distances.

The repair + id ) has the best score of '1' with id id ) being repaired. If that repair sequence had not been found, there would have been many candinates with score of '2'.

I'd have likely picked the ) + id to the input id id ) + id as that had the longest progress. Though on retrospect it seems the shorter repair sequences produced better results because they were more likely to insert in missing semicolons.

How do you get the repair sequence from here? Well, you get it this way:

      +  id )
   0  1  2  3
id 1  1  1  2
id 2  2  1  2
)  3  3  2  1

Now you see what the repair is.

Replace 'id' with '+'.

Lets do an another.

      )  +  id
   0  1  2  3
id 1  1  2  2    
id 2  2  2  2
)  3  2  3  3
+  4  3  2  3
id 5  4  3  2

There's a path that tells what to do.

Remove 'id',
Remove 'id'.

It's a bit unfortunate that I don't find that Ripley-Druseikis collection the paper used to test the algorithm.

Lets try it!

Ok, lets build one for Ratstail91/Toy. This is the grammar that motivated me to look into LR parsing again. Few weeks ago the author came to r/programminglanguages and asked to review his language.

You can examine the parser in this gist (and it's attached to this post):

https://gist.github.com/cheery/c6076315311c28db5a6d0edfae13e5bb

error-correcting-parser.zip

For this test input (I only took the broken lines):

16 va dict2 = Dictionary();
23 dict2.Insert("six", nothueisth 6);
26 dict2.Insert("nine", );

It's able to give the following error messages:

syntax error at line 16: inserted ";" to col 4
syntax error at line 16: inserted "var" to col 4
syntax error at line 23: inserted "," to col 32
syntax error at line 26: inserted STRING to col 22

This error repair algorithm would need something to accompany it. If the parser cannot finish from its position, it necessarily doesn't find a sequence to finish the parse.

Also, the error correction was complex enough as-it, so I didn't extend it to work through layout sensitivity.

Conclusion

I kind of feel that I am addicted to the idea, but what I've seen so far was not satisfying enough.

While writing this post, I saw a paper with more error recovery algorithms:

"Don't Panic! Better, Fewer, Syntax Errors for LR Parsers"

Lukas Diekmann, Laurence Tratt

I'll look into the algorithms presented in this paper some other time. Also if you have a paper about error correction algorithms in parsers, I'd like to hear about that.

Similar posts