Relational Parsing, part 0, recognizer

Examining relational parsing, a new general context-free parsing algorithm.

I will write a recognizer in haskell, and iterate through the design. I'm not certain I got it correct, so please examine the paper for details. They've done a good job explaining it, but I'm dumb and I had to guess some details.

Throughout the session, I will use the same grammar described here.

Elem0 -> Elem1
Elem0 -> Elem1 Plus Elem0
Elem1 -> Elem2
Elem1 -> Elem2 Mul Elem1
Elem2 -> Term
Elem2 -> LeftP Elem0 RightP

IMPORTANT! Left-recursive grammars require extra attention.

Our final module is going to use sets, maps and state.

module Recognizer where

import qualified Data.Set as Set
import qualified Data.Map as Map
import Control.Monad.State.Lazy

Symbols need to be described in Haskell:

data S = Elem0 | Elem1 | Elem2 | Term | Plus | Mul | LeftP | RightP
    deriving (Show, Eq, Ord)
type Token = S

The key concept of the algorithm is the left linear closure. It consists from derivation steps that are left-linear: they start with a terminal symbol. We can define it for our grammar as follows:

left_linear_closure :: S -> [[S]]
left_linear_closure Elem0 = [
    [Term],
    [LeftP, Elem0, RightP],
    [Term, Mul, Elem1],
    [LeftP, Elem0, RightP, Mul, Elem1],
    [Term, Plus, Elem0],
    [LeftP, Elem0, RightP, Plus, Elem0],
    [Term, Mul, Elem1, Plus, Elem0],
    [LeftP, Elem0, RightP, Mul, Elem1, Plus, Elem0]]
left_linear_closure Elem1 = [
    [Term],
    [LeftP, Elem0, RightP],
    [Term, Mul, Elem1],
    [LeftP, Elem0, RightP, Mul, Elem1]]
left_linear_closure Elem2 = [
    [Term],
    [LeftP, Elem0, RightP]]
left_linear_closure Term = [[Term]]
left_linear_closure Plus = [[Plus]]
left_linear_closure Mul = [[Mul]]
left_linear_closure LeftP = [[LeftP]]
left_linear_closure RightP = [[RightP]]

The paper combines the left-linear closure with a derivation step. It takes a set of words in a language and produces an another language with the symbol chopped off from the front.

derive :: [[S]] -> S -> [[S]]
derive xs s = foldl step [] xs
    where step :: [[S]] -> [S] -> [[S]]
          step xxs (x:xs) | (x == s) = (xs:xxs)
          step xxs xs = xxs

We define a starting language, this is empty-string-parsing.

start :: [[S]]
start = [[Elem0]]

And we need to list the symbols because we're iterating through them in each parsing step:

symbols :: [S]
symbols = [Elem0, Elem1, Term, Plus, Mul, LeftP, RightP]

For a start, we're going to use an inefficient language model, our language is going to be a set of words. just to show off the principle by which this algorithm works.

data Lang = Lang {unlang :: [[S]]} deriving (Show)

The recognizer algorithm is exactly like it's in the paper.

recognize :: [Token] -> Bool
recognize ts = epsilon (foldl step init ts)
    where init :: Lang
          init = prepend start root
          step :: Lang -> Token -> Lang
          step lang t = foldl (union_each (rstep lang t)) empty symbols
          union_each :: (S -> Lang) -> Lang -> S -> Lang
          union_each f lang' s = union lang' (f s)
          rstep :: Lang -> Token -> S -> Lang
          rstep lang t s = prepend
                           (left_linear_closure s `derive` t)
                           (derivative lang s)

The 'rstep' captures the essence of the algorithm. For each symbol, we're chopping off a nonterminal/terminal, to replace it with its left linear closure derived by a given terminal symbol.

Stub implementation follows for the language model:

root :: Lang
root = Lang [[]]

prepend :: [[S]] -> Lang -> Lang
prepend l (Lang lang) = Lang $ do h <- l
                                  t <- lang
                                  pure (h <> t)

derivative :: Lang -> S -> Lang
derivative (Lang lang) s = Lang (derive lang s)

empty :: Lang
empty = Lang []

union :: Lang -> Lang -> Lang
union (Lang a) (Lang b) = Lang (a <> b)

epsilon :: Lang -> Bool
epsilon (Lang a) = any (==[]) a

Finally lets test that it recognizes strings from our language.

test = recognize [LeftP, Term, Plus, Term, RightP]

This concludes the outline of the recognizer algorithm. Next we're going to refine it

Splitting the language model

If we look at the simple language model, the most expensive part it goes through is the prepend -step.

We shall approach the shape described in the paper.

data Lang = Lang {unlang :: Set.Set ([S], Lang), epsilon :: Bool} deriving (Show, Ord, Eq)

Epsilon describes whether the language contains the empty word, if it does, it means the string is being recognized. Recognizer stays intact, but our language model changes.

root :: Lang
root = Lang (Set.empty) True

prepend :: [[S]] -> Lang -> Lang
prepend seq lang = foldl union empty (fmap step seq)
    where step :: [S] -> Lang
          step [] = lang
          step xs = Lang (Set.singleton (xs, lang)) False

derivative :: Lang -> S -> Lang
derivative (Lang slang _) s = foldl union empty (Set.map step slang)
    where step :: ([S], Lang) -> Lang
          step ([x], lang) | (x == s) = lang
          step ((x:xs), lang) | (x == s) = Lang (Set.singleton (xs, lang)) False
          step (xs, lang) = empty

empty :: Lang
empty = Lang Set.empty False

union :: Lang -> Lang -> Lang
union (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)

Inserting reference numbers

The model forms a DAG, where lot of things are being shared between structures. Also, once structures are created they do not change. We'd like to exploit this by numbering our language models. For that we're going to create labeled references.

data LangRef = LangRef {langid :: Int, unref :: Lang} deriving (Show)

instance Eq LangRef where
    (==) a b = (langid a == langid b)
instance Ord LangRef where
    (<=) a b = (langid a <= langid b)

data Lang = Lang {unlang :: Set.Set ([S], LangRef), epsilon :: Bool} deriving (Show)

type LangContext a = State Int a

introduce :: Lang -> LangContext LangRef
introduce lang = do
    i <- get
    put (i + 1)
    pure (LangRef i lang)

The recognizer needs to be changed to introduce the state for numbering references.

recognize :: [Token] -> Bool
recognize ts = epsilon $ fst (runState (recognizeI ts) 0)

recognizeI :: [Token] -> LangContext Lang
recognizeI ts = do k <- introduce root
                   foldM step (prepend start k) ts
    where step :: Lang -> Token -> LangContext Lang
          step lang t = foldM (union_each (rstep lang t)) empty (expect lang)
          union_each :: (S -> LangContext Lang) -> Lang -> S -> LangContext Lang
          union_each f lang' s = do
            next <- f s
            pure (union lang' next)
          rstep :: Lang -> Token -> S -> LangContext Lang
          rstep lang t s = do
            i <- introduce (derivative lang s)
            pure (prepend (left_linear_closure s `derive` t) i)

We also no longer "shift" by every symbol, instead we limit it to symbols that are being expected.

expect :: Lang -> [S]
expect (Lang slang _) = Set.toList (Set.fromList (fmap step (Set.toList slang)))
    where step :: ([S], LangRef) -> S
          step (xs, _) = head xs

The language model changes very little.

root :: Lang
root = Lang Set.empty True

prepend :: [[S]] -> LangRef -> Lang
prepend seq lang = foldl union empty (fmap step seq)
    where step :: [S] -> Lang
          step [] = unref lang
          step xs = Lang (Set.singleton (xs, lang)) False

derivative :: Lang -> S -> Lang
derivative (Lang slang _) s = foldl union empty (fmap step (Set.toList slang))
    where step :: ([S], LangRef) -> Lang
          step ([x], lang) | (x == s) = unref lang
          step ((x:xs), lang) | (x == s) = Lang (Set.singleton (xs, lang)) False
          step (xs, lang) = empty

empty :: Lang
empty = Lang Set.empty False

union :: Lang -> Lang -> Lang
union (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)

Breaking down into states

Examining further, we can notice the left-linear closure can be broken into a state machine.

left_linear_closure_state :: S -> Int
left_linear_closure_state Elem0 = 1
left_linear_closure_state Elem1 = 2
left_linear_closure_state Elem2 = 3
left_linear_closure_state Term = 4
left_linear_closure_state Plus = 5
left_linear_closure_state Mul = 6
left_linear_closure_state LeftP = 7
left_linear_closure_state RightP = 8

transitions :: Int -> S -> Maybe Int
transitions 0 _ = Nothing
transitions 1 LeftP = pure 9
--    [LeftP, Elem0, RightP],
--    [LeftP, Elem0, RightP, Mul, Elem1],
--    [LeftP, Elem0, RightP, Mul, Elem1, Plus, Elem0],
--    [LeftP, Elem0, RightP, Plus, Elem0],
transitions 1 Term = pure 10
--    [Term],
--    [Term, Mul, Elem1],
--    [Term, Mul, Elem1, Plus, Elem0],
--    [Term, Plus, Elem0]]
transitions 2 LeftP = pure 11
--    [LeftP, Elem0, RightP],
--    [LeftP, Elem0, RightP, Mul, Elem1],
transitions 2 Term = pure 12
--    [Term],
--    [Term, Mul, Elem1]]
transitions 3 LeftP = pure 13
--    [LeftP, Elem0, RightP],
transitions 3 Term = pure 0
--    [Term]]
transitions 4 Term = pure 0
transitions 5 Plus = pure 0
transitions 6 Mul = pure 0
transitions 7 LeftP = pure 0
transitions 8 RightP = pure 0
transitions 9 Elem0 = pure 14
--    [Elem0, RightP],
--    [Elem0, RightP, Mul, Elem1],
--    [Elem0, RightP, Mul, Elem1, Plus, Elem0],
--    [Elem0, RightP, Plus, Elem0],
transitions 10 Mul = pure 15
--    [],
--    [Mul, Elem1],
--    [Mul, Elem1, Plus, Elem0],
transitions 10 Plus = pure 16
--    [Plus, Elem0]]
transitions 11 Elem0 = pure 17
--    [Elem0, RightP],
--    [Elem0, RightP, Mul, Elem1],
transitions 12 Mul = pure 18
--    [],
--    [Mul, Elem1]]
transitions 13 Elem0 = pure 19
--    [Elem0, RightP],
transitions 14 RightP = pure 20
--    [RightP],
--    [RightP, Mul, Elem1],
--    [RightP, Mul, Elem1, Plus, Elem0],
--    [RightP, Plus, Elem0],
transitions 15 Elem1 = pure 21
--    [Elem1],
--    [Elem1, Plus, Elem0],
transitions 16 Elem0 = pure 0
--    [Elem0]]
transitions 17 RightP = pure 22
--    [RightP],
--    [RightP, Mul, Elem1],
transitions 18 Elem1 = pure 0
--    [Elem1]]
transitions 19 RightP = pure 0
--    [RightP],
transitions 20 Mul = pure 15
--    [],
--    [Mul, Elem1],
--    [Mul, Elem1, Plus, Elem0],
transitions 20 Plus = pure 16
--    [Plus, Elem0],
transitions 21 Plus = pure 16
--    [],
--    [Plus, Elem0],
transitions 22 Mul = pure 18
--    [],
--    [Mul, Elem1],
transitions _ _ = Nothing

expectations :: Int -> [S]
expectations 0 = []
expectations 1 = [LeftP, Term]
expectations 2 = [LeftP, Term]
expectations 3 = [LeftP, Term]
expectations 4 = [Term]
expectations 5 = [Plus]
expectations 6 = [Mul]
expectations 7 = [LeftP]
expectations 8 = [RightP]
expectations 9 = [Elem0]
expectations 10 = [Mul,Plus]
expectations 11 = [Elem0]
expectations 12 = [Mul]
expectations 13 = [Elem0]
expectations 14 = [RightP]
expectations 15 = [Elem1]
expectations 16 = [Elem0]
expectations 17 = [RightP]
expectations 18 = [Elem1]
expectations 19 = [RightP]
expectations 20 = [Mul,Plus]
expectations 21 = [Plus]
expectations 22 = [Mul]

terminations :: Int -> Bool
terminations 0 = True
terminations 10 = True
terminations 12 = True
terminations 20 = True
terminations 21 = True
terminations 22 = True
terminations _ = False

start :: Int
start = 16

Derivation step becomes a really simple transition through the state machine.

derive :: Int -> S -> Maybe Int
derive xs s = transitions xs s

The language model flips from set of words to states.

data Lang = Lang {unlang :: Set.Set (Int, LangRef), epsilon :: Bool} deriving (Show)

In the recognizer, we upgrade the rstep. Now it introduces the derivative only if we derive something.

      rstep lang t s = do
        case left_linear_closure_state s `derive` t of
            Nothing -> pure empty
            Just xs -> do
                i <- introduce (derivative lang s)
                pure (prepend xs i)

The language model is changed to operate on state transitions.

expect :: Lang -> [S]
expect (Lang slang _) = Set.toList (Set.fromList (join (fmap step (Set.toList slang))))
    where step :: (Int, LangRef) -> [S]
          step (xs, lang) = expectations xs

root :: Lang
root = Lang Set.empty True

prepend :: Int -> LangRef -> Lang
prepend xs lang = if terminations xs then if length (expectations xs) == 0
                                          then unref lang
                                          else Lang (Set.singleton (xs, lang)) False `union` unref lang
                                     else Lang (Set.singleton (xs, lang)) False

derivative :: Lang -> S -> Lang
derivative (Lang slang _) s = foldl union empty (fmap step (Set.toList slang))
    where step :: (Int, LangRef) -> Lang
          step (xs, lang) = case transitions xs s of
            Just ys -> prepend ys lang
            Nothing -> empty

empty :: Lang
empty = Lang Set.empty False

union :: Lang -> Lang -> Lang
union (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)

Now I'm done with the recognizer for now.

Mid-conclusion

Consisting of roughly 70 lines of Haskell.

This algorithm is conceptually simple and efficient except to one problem: What should we do with left-recursive grammars?

Left recursion on left linear closure

Lets take a different grammar, an one that's left recursive.

E -> P
E -> E Plus P
P -> R
P -> P Mul R
R -> Term
R -> LeftP E RightP

I illustrate how to resolve recursive rules when constructing the left linear closure. This isn't in the paper, I made it up by examining and thinking about the problem a bit.

First number the rules and flip the lhs symbol to suffix.

P {0 E}
E Plus P {1 E}
R {2 P}
P Mul R {3 P}
Term {4 R}
LeftP E RightP {5 R}

Then select one rule and derive it once. We mark up which rule was pasted in, by suffix.

P {0 E}
───────────────────
R {2 P} {0 E}
P Mul R {3 P} {0 E}

Notice that rules 1 and 0:3 here are recursive, and we know it because we just marked it.

P Mul R {3 P} {0 E}
E Plus P {1 E}

We put these rules aside and continue on the remaining rule.

R {2 P} {0 E}
────────────────────────────────
Term {4 R} {2 P} {0 E}
LeftP E RightP {5 R} {2 P} {0 E}

Next we form regular expressions by pasting the remaining rules in. How do we do that? Well, observe that the rule E Plus P {1 E} is recursive with toplevel, we'll drop it there.

Term {4 R} {2 P} {0 E} (Plus P {1 E})*
LeftP E RightP {5 R} {2 P} {0 E} (Plus P {1 E})*

The second recursive rule P Mul R {3 P} {0 E} is recursive on second level. We'll clip and paste it before {0 E} -subsequence. If you had a longer subsequence, you'd clip it up appropriately and paste it before the occurrence of a whole subsequence.

Term {4 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*
LeftP E RightP {5 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*

Repeating these steps, we'll get.

Term {4 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*
LeftP E RightP {5 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*
Term {4 R} {2 P} (Mul R {3 P})*
LeftP E RightP {5 R} {2 P} (Mul R {3 P})*
Term {4 R}
LeftP E RightP {5 R}

You can do this transform in any order and it ought produce the same result.

Language edit distance

The another interesting case is the language edit distance/error correcting parser. We can use an old trick to formulate it.

Aho, A. V., & Peterson, T. G. (1972). A Minimum Distance Error-Correcting Parser for Context-Free Languages. SIAM Journal on Computing, 1(4), 305–312. doi:10.1137/0201022

We create a grammar with corrections, that covers the existing grammar:

The covering grammar matches every input. Either by ignoring it (the I rules), replacing it (the E(a) → b), or inserting it (the empty rules).

We can avoid some complexity here by modifying the recognizer. First, sample the next set by previous set, this replaces all H -rules.

Next, shift with every terminal symbol, this replaces the replacement rules.

Finally what we are left with is the empty rule for every terminal. It can be replaced by an adjustment in the preprocessing phase.

Adjusting for empty rules, the grammar explodes in size. I wrote a test program to construct a left-linear closure with emptiable rules. We get 186 rules for our simple grammar and I think I didn't do it correctly.

Conclusion

If you have a suitable grammar then it's an incredibly simple algorithm. Emptiable rules and left recursive grammars add complexity though.

Overall I like what I've seen so far but I find the paper challenging to understand. There's continuation to form an article series. I think the next step would be to look closer at emptiable rules, then look into the parsing -part.