# Relational Parsing, part 0, recognizer

Examining relational parsing, a new general context-free parsing algorithm.

I will write a recognizer in haskell, and iterate through the design. I'm not certain I got it correct, so please examine the paper for details. They've done a good job explaining it, but I'm dumb and I had to guess some details.

Throughout the session, I will use the same grammar described here.

`Elem0 -> Elem1`

`Elem0 -> Elem1 Plus Elem0`

`Elem1 -> Elem2`

`Elem1 -> Elem2 Mul Elem1`

`Elem2 -> Term`

`Elem2 -> LeftP Elem0 RightP`

IMPORTANT! Left-recursive grammars require extra attention.

Our final module is going to use sets, maps and state.

`module Recognizer where`

`import qualified Data.Set as Set`

`import qualified Data.Map as Map`

`import Control.Monad.State.Lazy`

Symbols need to be described in Haskell:

`data S = Elem0 | Elem1 | Elem2 | Term | Plus | Mul | LeftP | RightP`

`deriving (Show, Eq, Ord)`

`type Token = S`

The key concept of the algorithm is the left linear closure. It consists from derivation steps that are left-linear: they start with a terminal symbol. We can define it for our grammar as follows:

`left_linear_closure :: S -> [[S]]`

`left_linear_closure Elem0 = [`

`[Term],`

`[LeftP, Elem0, RightP],`

`[Term, Mul, Elem1],`

`[LeftP, Elem0, RightP, Mul, Elem1],`

`[Term, Plus, Elem0],`

`[LeftP, Elem0, RightP, Plus, Elem0],`

`[Term, Mul, Elem1, Plus, Elem0],`

`[LeftP, Elem0, RightP, Mul, Elem1, Plus, Elem0]]`

`left_linear_closure Elem1 = [`

`[Term],`

`[LeftP, Elem0, RightP],`

`[Term, Mul, Elem1],`

`[LeftP, Elem0, RightP, Mul, Elem1]]`

`left_linear_closure Elem2 = [`

`[Term],`

`[LeftP, Elem0, RightP]]`

`left_linear_closure Term = [[Term]]`

`left_linear_closure Plus = [[Plus]]`

`left_linear_closure Mul = [[Mul]]`

`left_linear_closure LeftP = [[LeftP]]`

`left_linear_closure RightP = [[RightP]]`

The paper combines the left-linear closure with a derivation step. It takes a set of words in a language and produces an another language with the symbol chopped off from the front.

`derive :: [[S]] -> S -> [[S]]`

`derive xs s = foldl step [] xs`

`where step :: [[S]] -> [S] -> [[S]]`

`step xxs (x:xs) | (x == s) = (xs:xxs)`

`step xxs xs = xxs`

We define a starting language, this is empty-string-parsing.

`start :: [[S]]`

`start = [[Elem0]]`

And we need to list the symbols because we're iterating through them in each parsing step:

`symbols :: [S]`

`symbols = [Elem0, Elem1, Term, Plus, Mul, LeftP, RightP]`

For a start, we're going to use an inefficient language model, our language is going to be a set of words. just to show off the principle by which this algorithm works.

`data Lang = Lang {unlang :: [[S]]} deriving (Show)`

The recognizer algorithm is exactly like it's in the paper.

`recognize :: [Token] -> Bool`

`recognize ts = epsilon (foldl step init ts)`

`where init :: Lang`

`init = prepend start root`

`step :: Lang -> Token -> Lang`

`step lang t = foldl (union_each (rstep lang t)) empty symbols`

`union_each :: (S -> Lang) -> Lang -> S -> Lang`

`union_each f lang' s = union lang' (f s)`

`rstep :: Lang -> Token -> S -> Lang`

`rstep lang t s = prepend`

`(left_linear_closure s `derive` t)`

`(derivative lang s)`

The 'rstep' captures the essence of the algorithm. For each symbol, we're chopping off a nonterminal/terminal, to replace it with its left linear closure derived by a given terminal symbol.

Stub implementation follows for the language model:

`root :: Lang`

`root = Lang [[]]`

`prepend :: [[S]] -> Lang -> Lang`

`prepend l (Lang lang) = Lang $ do h <- l`

`t <- lang`

`pure (h <> t)`

`derivative :: Lang -> S -> Lang`

`derivative (Lang lang) s = Lang (derive lang s)`

`empty :: Lang`

`empty = Lang []`

`union :: Lang -> Lang -> Lang`

`union (Lang a) (Lang b) = Lang (a <> b)`

`epsilon :: Lang -> Bool`

`epsilon (Lang a) = any (==[]) a`

Finally lets test that it recognizes strings from our language.

`test = recognize [LeftP, Term, Plus, Term, RightP]`

This concludes the outline of the recognizer algorithm. Next we're going to refine it

## Splitting the language model

If we look at the simple language model,
the most expensive part it goes through is the `prepend`

-step.

We shall approach the shape described in the paper.

`data Lang = Lang {unlang :: Set.Set ([S], Lang), epsilon :: Bool} deriving (Show, Ord, Eq)`

Epsilon describes whether the language contains the empty word, if it does, it means the string is being recognized. Recognizer stays intact, but our language model changes.

`root :: Lang`

`root = Lang (Set.empty) True`

`prepend :: [[S]] -> Lang -> Lang`

`prepend seq lang = foldl union empty (fmap step seq)`

`where step :: [S] -> Lang`

`step [] = lang`

`step xs = Lang (Set.singleton (xs, lang)) False`

`derivative :: Lang -> S -> Lang`

`derivative (Lang slang _) s = foldl union empty (Set.map step slang)`

`where step :: ([S], Lang) -> Lang`

`step ([x], lang) | (x == s) = lang`

`step ((x:xs), lang) | (x == s) = Lang (Set.singleton (xs, lang)) False`

`step (xs, lang) = empty`

`empty :: Lang`

`empty = Lang Set.empty False`

`union :: Lang -> Lang -> Lang`

`union (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)`

## Inserting reference numbers

The model forms a DAG, where lot of things are being shared between structures. Also, once structures are created they do not change. We'd like to exploit this by numbering our language models. For that we're going to create labeled references.

`data LangRef = LangRef {langid :: Int, unref :: Lang} deriving (Show)`

`instance Eq LangRef where`

`(==) a b = (langid a == langid b)`

`instance Ord LangRef where`

`(<=) a b = (langid a <= langid b)`

`data Lang = Lang {unlang :: Set.Set ([S], LangRef), epsilon :: Bool} deriving (Show)`

`type LangContext a = State Int a`

`introduce :: Lang -> LangContext LangRef`

`introduce lang = do`

`i <- get`

`put (i + 1)`

`pure (LangRef i lang)`

The recognizer needs to be changed to introduce the state for numbering references.

`recognize :: [Token] -> Bool`

`recognize ts = epsilon $ fst (runState (recognizeI ts) 0)`

`recognizeI :: [Token] -> LangContext Lang`

`recognizeI ts = do k <- introduce root`

`foldM step (prepend start k) ts`

`where step :: Lang -> Token -> LangContext Lang`

`step lang t = foldM (union_each (rstep lang t)) empty (expect lang)`

`union_each :: (S -> LangContext Lang) -> Lang -> S -> LangContext Lang`

`union_each f lang' s = do`

`next <- f s`

`pure (union lang' next)`

`rstep :: Lang -> Token -> S -> LangContext Lang`

`rstep lang t s = do`

`i <- introduce (derivative lang s)`

`pure (prepend (left_linear_closure s `derive` t) i)`

We also no longer "shift" by every symbol, instead we limit it to symbols that are being expected.

`expect :: Lang -> [S]`

`expect (Lang slang _) = Set.toList (Set.fromList (fmap step (Set.toList slang)))`

`where step :: ([S], LangRef) -> S`

`step (xs, _) = head xs`

The language model changes very little.

`root :: Lang`

`root = Lang Set.empty True`

`prepend :: [[S]] -> LangRef -> Lang`

`prepend seq lang = foldl union empty (fmap step seq)`

`where step :: [S] -> Lang`

`step [] = unref lang`

`step xs = Lang (Set.singleton (xs, lang)) False`

`derivative :: Lang -> S -> Lang`

`derivative (Lang slang _) s = foldl union empty (fmap step (Set.toList slang))`

`where step :: ([S], LangRef) -> Lang`

`step ([x], lang) | (x == s) = unref lang`

`step ((x:xs), lang) | (x == s) = Lang (Set.singleton (xs, lang)) False`

`step (xs, lang) = empty`

`empty :: Lang`

`empty = Lang Set.empty False`

`union :: Lang -> Lang -> Lang`

`union (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)`

## Breaking down into states

Examining further, we can notice the left-linear closure can be broken into a state machine.

`left_linear_closure_state :: S -> Int`

`left_linear_closure_state Elem0 = 1`

`left_linear_closure_state Elem1 = 2`

`left_linear_closure_state Elem2 = 3`

`left_linear_closure_state Term = 4`

`left_linear_closure_state Plus = 5`

`left_linear_closure_state Mul = 6`

`left_linear_closure_state LeftP = 7`

`left_linear_closure_state RightP = 8`

`transitions :: Int -> S -> Maybe Int`

`transitions 0 _ = Nothing`

`transitions 1 LeftP = pure 9`

`-- [LeftP, Elem0, RightP],`

`-- [LeftP, Elem0, RightP, Mul, Elem1],`

`-- [LeftP, Elem0, RightP, Mul, Elem1, Plus, Elem0],`

`-- [LeftP, Elem0, RightP, Plus, Elem0],`

`transitions 1 Term = pure 10`

`-- [Term],`

`-- [Term, Mul, Elem1],`

`-- [Term, Mul, Elem1, Plus, Elem0],`

`-- [Term, Plus, Elem0]]`

`transitions 2 LeftP = pure 11`

`-- [LeftP, Elem0, RightP],`

`-- [LeftP, Elem0, RightP, Mul, Elem1],`

`transitions 2 Term = pure 12`

`-- [Term],`

`-- [Term, Mul, Elem1]]`

`transitions 3 LeftP = pure 13`

`-- [LeftP, Elem0, RightP],`

`transitions 3 Term = pure 0`

`-- [Term]]`

`transitions 4 Term = pure 0`

`transitions 5 Plus = pure 0`

`transitions 6 Mul = pure 0`

`transitions 7 LeftP = pure 0`

`transitions 8 RightP = pure 0`

`transitions 9 Elem0 = pure 14`

`-- [Elem0, RightP],`

`-- [Elem0, RightP, Mul, Elem1],`

`-- [Elem0, RightP, Mul, Elem1, Plus, Elem0],`

`-- [Elem0, RightP, Plus, Elem0],`

`transitions 10 Mul = pure 15`

`-- [],`

`-- [Mul, Elem1],`

`-- [Mul, Elem1, Plus, Elem0],`

`transitions 10 Plus = pure 16`

`-- [Plus, Elem0]]`

`transitions 11 Elem0 = pure 17`

`-- [Elem0, RightP],`

`-- [Elem0, RightP, Mul, Elem1],`

`transitions 12 Mul = pure 18`

`-- [],`

`-- [Mul, Elem1]]`

`transitions 13 Elem0 = pure 19`

`-- [Elem0, RightP],`

`transitions 14 RightP = pure 20`

`-- [RightP],`

`-- [RightP, Mul, Elem1],`

`-- [RightP, Mul, Elem1, Plus, Elem0],`

`-- [RightP, Plus, Elem0],`

`transitions 15 Elem1 = pure 21`

`-- [Elem1],`

`-- [Elem1, Plus, Elem0],`

`transitions 16 Elem0 = pure 0`

`-- [Elem0]]`

`transitions 17 RightP = pure 22`

`-- [RightP],`

`-- [RightP, Mul, Elem1],`

`transitions 18 Elem1 = pure 0`

`-- [Elem1]]`

`transitions 19 RightP = pure 0`

`-- [RightP],`

`transitions 20 Mul = pure 15`

`-- [],`

`-- [Mul, Elem1],`

`-- [Mul, Elem1, Plus, Elem0],`

`transitions 20 Plus = pure 16`

`-- [Plus, Elem0],`

`transitions 21 Plus = pure 16`

`-- [],`

`-- [Plus, Elem0],`

`transitions 22 Mul = pure 18`

`-- [],`

`-- [Mul, Elem1],`

`transitions _ _ = Nothing`

`expectations :: Int -> [S]`

`expectations 0 = []`

`expectations 1 = [LeftP, Term]`

`expectations 2 = [LeftP, Term]`

`expectations 3 = [LeftP, Term]`

`expectations 4 = [Term]`

`expectations 5 = [Plus]`

`expectations 6 = [Mul]`

`expectations 7 = [LeftP]`

`expectations 8 = [RightP]`

`expectations 9 = [Elem0]`

`expectations 10 = [Mul,Plus]`

`expectations 11 = [Elem0]`

`expectations 12 = [Mul]`

`expectations 13 = [Elem0]`

`expectations 14 = [RightP]`

`expectations 15 = [Elem1]`

`expectations 16 = [Elem0]`

`expectations 17 = [RightP]`

`expectations 18 = [Elem1]`

`expectations 19 = [RightP]`

`expectations 20 = [Mul,Plus]`

`expectations 21 = [Plus]`

`expectations 22 = [Mul]`

`terminations :: Int -> Bool`

`terminations 0 = True`

`terminations 10 = True`

`terminations 12 = True`

`terminations 20 = True`

`terminations 21 = True`

`terminations 22 = True`

`terminations _ = False`

`start :: Int`

`start = 16`

Derivation step becomes a really simple transition through the state machine.

`derive :: Int -> S -> Maybe Int`

`derive xs s = transitions xs s`

The language model flips from set of words to states.

`data Lang = Lang {unlang :: Set.Set (Int, LangRef), epsilon :: Bool} deriving (Show)`

In the recognizer, we upgrade the rstep. Now it introduces the derivative only if we derive something.

`rstep lang t s = do`

`case left_linear_closure_state s `derive` t of`

`Nothing -> pure empty`

`Just xs -> do`

`i <- introduce (derivative lang s)`

`pure (prepend xs i)`

The language model is changed to operate on state transitions.

`expect :: Lang -> [S]`

`expect (Lang slang _) = Set.toList (Set.fromList (join (fmap step (Set.toList slang))))`

`where step :: (Int, LangRef) -> [S]`

`step (xs, lang) = expectations xs`

`root :: Lang`

`root = Lang Set.empty True`

`prepend :: Int -> LangRef -> Lang`

`prepend xs lang = if terminations xs then if length (expectations xs) == 0`

`then unref lang`

`else Lang (Set.singleton (xs, lang)) False `union` unref lang`

`else Lang (Set.singleton (xs, lang)) False`

`derivative :: Lang -> S -> Lang`

`derivative (Lang slang _) s = foldl union empty (fmap step (Set.toList slang))`

`where step :: (Int, LangRef) -> Lang`

`step (xs, lang) = case transitions xs s of`

`Just ys -> prepend ys lang`

`Nothing -> empty`

`empty :: Lang`

`empty = Lang Set.empty False`

`union :: Lang -> Lang -> Lang`

`union (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)`

Now I'm done with the recognizer for now.

## Mid-conclusion

Consisting of roughly 70 lines of Haskell.

This algorithm is conceptually simple and efficient except to one problem: What should we do with left-recursive grammars?

## Left recursion on left linear closure

Lets take a different grammar, an one that's left recursive.

`E -> P`

`E -> E Plus P`

`P -> R`

`P -> P Mul R`

`R -> Term`

`R -> LeftP E RightP`

I illustrate how to resolve recursive rules when constructing the left linear closure. This isn't in the paper, I made it up by examining and thinking about the problem a bit.

First number the rules and flip the lhs symbol to suffix.

`P {0 E}`

`E Plus P {1 E}`

`R {2 P}`

`P Mul R {3 P}`

`Term {4 R}`

`LeftP E RightP {5 R}`

Then select one rule and derive it once. We mark up which rule was pasted in, by suffix.

`P {0 E}`

`───────────────────`

`R {2 P} {0 E}`

`P Mul R {3 P} {0 E}`

Notice that rules `1`

and `0:3`

here are recursive,
and we know it because we just marked it.

`P Mul R {3 P} {0 E}`

`E Plus P {1 E}`

We put these rules aside and continue on the remaining rule.

`R {2 P} {0 E}`

`────────────────────────────────`

`Term {4 R} {2 P} {0 E}`

`LeftP E RightP {5 R} {2 P} {0 E}`

Next we form regular expressions by pasting the remaining rules in.
How do we do that?
Well, observe that the rule `E Plus P {1 E}`

is recursive with toplevel,
we'll drop it there.

`Term {4 R} {2 P} {0 E} (Plus P {1 E})*`

`LeftP E RightP {5 R} {2 P} {0 E} (Plus P {1 E})*`

The second recursive rule `P Mul R {3 P} {0 E}`

is recursive on second level.
We'll clip and paste it before `{0 E}`

-subsequence.
If you had a longer subsequence, you'd clip it up appropriately
and paste it before the occurrence of a whole subsequence.

`Term {4 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*`

`LeftP E RightP {5 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*`

Repeating these steps, we'll get.

`Term {4 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*`

`LeftP E RightP {5 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*`

`Term {4 R} {2 P} (Mul R {3 P})*`

`LeftP E RightP {5 R} {2 P} (Mul R {3 P})*`

`Term {4 R}`

`LeftP E RightP {5 R}`

You can do this transform in any order and it ought produce the same result.

## Language edit distance

The another interesting case is the language edit distance/error correcting parser. We can use an old trick to formulate it.

`Aho, A. V., & Peterson, T. G. (1972). A Minimum Distance Error-Correcting Parser for Context-Free Languages. SIAM Journal on Computing, 1(4), 305–312. doi:10.1137/0201022`

We create a grammar with corrections, that covers the existing grammar:

- Replace every terminal
`e`

by a nonterminal`E(e)`

. - Replace start symbol by start'.
Add the following productions:

start' → start start' → start H H → H I H → I

For every terminal

`a`

in the grammar, add the following:E(a) → a E(a) → b for every other terminal. E(a) → H a E(a) → I → a

The covering grammar matches every input. Either by ignoring it (the I rules), replacing it (the E(a) → b), or inserting it (the empty rules).

We can avoid some complexity here by modifying the recognizer.
First, sample the next set by previous set, this replaces all `H`

-rules.

Next, shift with every terminal symbol, this replaces the replacement rules.

Finally what we are left with is the empty rule for every terminal. It can be replaced by an adjustment in the preprocessing phase.

Adjusting for empty rules, the grammar explodes in size. I wrote a test program to construct a left-linear closure with emptiable rules. We get 186 rules for our simple grammar and I think I didn't do it correctly.

## Conclusion

If you have a suitable grammar then it's an incredibly simple algorithm. Emptiable rules and left recursive grammars add complexity though.

Overall I like what I've seen so far but I find the paper challenging to understand. There's continuation to form an article series. I think the next step would be to look closer at emptiable rules, then look into the parsing -part.