Relational Parsing, part 0, recognizer
Examining relational parsing, a new general context-free parsing algorithm.
I will write a recognizer in haskell, and iterate through the design. I'm not certain I got it correct, so please examine the paper for details. They've done a good job explaining it, but I'm dumb and I had to guess some details.
Throughout the session, I will use the same grammar described here.
Elem0 -> Elem1Elem0 -> Elem1 Plus Elem0Elem1 -> Elem2Elem1 -> Elem2 Mul Elem1Elem2 -> TermElem2 -> LeftP Elem0 RightP
IMPORTANT! Left-recursive grammars require extra attention.
Our final module is going to use sets, maps and state.
module Recognizer whereimport qualified Data.Set as Setimport qualified Data.Map as Mapimport Control.Monad.State.Lazy
Symbols need to be described in Haskell:
data S = Elem0 | Elem1 | Elem2 | Term | Plus | Mul | LeftP | RightPderiving (Show, Eq, Ord)type Token = S
The key concept of the algorithm is the left linear closure. It consists from derivation steps that are left-linear: they start with a terminal symbol. We can define it for our grammar as follows:
left_linear_closure :: S -> [[S]]left_linear_closure Elem0 = [[Term],[LeftP, Elem0, RightP],[Term, Mul, Elem1],[LeftP, Elem0, RightP, Mul, Elem1],[Term, Plus, Elem0],[LeftP, Elem0, RightP, Plus, Elem0],[Term, Mul, Elem1, Plus, Elem0],[LeftP, Elem0, RightP, Mul, Elem1, Plus, Elem0]]left_linear_closure Elem1 = [[Term],[LeftP, Elem0, RightP],[Term, Mul, Elem1],[LeftP, Elem0, RightP, Mul, Elem1]]left_linear_closure Elem2 = [[Term],[LeftP, Elem0, RightP]]left_linear_closure Term = [[Term]]left_linear_closure Plus = [[Plus]]left_linear_closure Mul = [[Mul]]left_linear_closure LeftP = [[LeftP]]left_linear_closure RightP = [[RightP]]
The paper combines the left-linear closure with a derivation step. It takes a set of words in a language and produces an another language with the symbol chopped off from the front.
derive :: [[S]] -> S -> [[S]]derive xs s = foldl step [] xswhere step :: [[S]] -> [S] -> [[S]]step xxs (x:xs) | (x == s) = (xs:xxs)step xxs xs = xxs
We define a starting language, this is empty-string-parsing.
start :: [[S]]start = [[Elem0]]
And we need to list the symbols because we're iterating through them in each parsing step:
symbols :: [S]symbols = [Elem0, Elem1, Term, Plus, Mul, LeftP, RightP]
For a start, we're going to use an inefficient language model, our language is going to be a set of words. just to show off the principle by which this algorithm works.
data Lang = Lang {unlang :: [[S]]} deriving (Show)The recognizer algorithm is exactly like it's in the paper.
recognize :: [Token] -> Boolrecognize ts = epsilon (foldl step init ts)where init :: Langinit = prepend start rootstep :: Lang -> Token -> Langstep lang t = foldl (union_each (rstep lang t)) empty symbolsunion_each :: (S -> Lang) -> Lang -> S -> Langunion_each f lang' s = union lang' (f s)rstep :: Lang -> Token -> S -> Langrstep lang t s = prepend(left_linear_closure s `derive` t)(derivative lang s)
The 'rstep' captures the essence of the algorithm. For each symbol, we're chopping off a nonterminal/terminal, to replace it with its left linear closure derived by a given terminal symbol.
Stub implementation follows for the language model:
root :: Langroot = Lang [[]]prepend :: [[S]] -> Lang -> Langprepend l (Lang lang) = Lang $ do h <- lt <- langpure (h <> t)derivative :: Lang -> S -> Langderivative (Lang lang) s = Lang (derive lang s)empty :: Langempty = Lang []union :: Lang -> Lang -> Langunion (Lang a) (Lang b) = Lang (a <> b)epsilon :: Lang -> Boolepsilon (Lang a) = any (==[]) a
Finally lets test that it recognizes strings from our language.
test = recognize [LeftP, Term, Plus, Term, RightP]This concludes the outline of the recognizer algorithm. Next we're going to refine it
Splitting the language model
If we look at the simple language model,
the most expensive part it goes through is the prepend -step.
We shall approach the shape described in the paper.
data Lang = Lang {unlang :: Set.Set ([S], Lang), epsilon :: Bool} deriving (Show, Ord, Eq)Epsilon describes whether the language contains the empty word, if it does, it means the string is being recognized. Recognizer stays intact, but our language model changes.
root :: Langroot = Lang (Set.empty) Trueprepend :: [[S]] -> Lang -> Langprepend seq lang = foldl union empty (fmap step seq)where step :: [S] -> Langstep [] = langstep xs = Lang (Set.singleton (xs, lang)) Falsederivative :: Lang -> S -> Langderivative (Lang slang _) s = foldl union empty (Set.map step slang)where step :: ([S], Lang) -> Langstep ([x], lang) | (x == s) = langstep ((x:xs), lang) | (x == s) = Lang (Set.singleton (xs, lang)) Falsestep (xs, lang) = emptyempty :: Langempty = Lang Set.empty Falseunion :: Lang -> Lang -> Langunion (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)
Inserting reference numbers
The model forms a DAG, where lot of things are being shared between structures. Also, once structures are created they do not change. We'd like to exploit this by numbering our language models. For that we're going to create labeled references.
data LangRef = LangRef {langid :: Int, unref :: Lang} deriving (Show)instance Eq LangRef where(==) a b = (langid a == langid b)instance Ord LangRef where(<=) a b = (langid a <= langid b)data Lang = Lang {unlang :: Set.Set ([S], LangRef), epsilon :: Bool} deriving (Show)type LangContext a = State Int aintroduce :: Lang -> LangContext LangRefintroduce lang = doi <- getput (i + 1)pure (LangRef i lang)
The recognizer needs to be changed to introduce the state for numbering references.
recognize :: [Token] -> Boolrecognize ts = epsilon $ fst (runState (recognizeI ts) 0)recognizeI :: [Token] -> LangContext LangrecognizeI ts = do k <- introduce rootfoldM step (prepend start k) tswhere step :: Lang -> Token -> LangContext Langstep lang t = foldM (union_each (rstep lang t)) empty (expect lang)union_each :: (S -> LangContext Lang) -> Lang -> S -> LangContext Langunion_each f lang' s = donext <- f spure (union lang' next)rstep :: Lang -> Token -> S -> LangContext Langrstep lang t s = doi <- introduce (derivative lang s)pure (prepend (left_linear_closure s `derive` t) i)
We also no longer "shift" by every symbol, instead we limit it to symbols that are being expected.
expect :: Lang -> [S]expect (Lang slang _) = Set.toList (Set.fromList (fmap step (Set.toList slang)))where step :: ([S], LangRef) -> Sstep (xs, _) = head xs
The language model changes very little.
root :: Langroot = Lang Set.empty Trueprepend :: [[S]] -> LangRef -> Langprepend seq lang = foldl union empty (fmap step seq)where step :: [S] -> Langstep [] = unref langstep xs = Lang (Set.singleton (xs, lang)) Falsederivative :: Lang -> S -> Langderivative (Lang slang _) s = foldl union empty (fmap step (Set.toList slang))where step :: ([S], LangRef) -> Langstep ([x], lang) | (x == s) = unref langstep ((x:xs), lang) | (x == s) = Lang (Set.singleton (xs, lang)) Falsestep (xs, lang) = emptyempty :: Langempty = Lang Set.empty Falseunion :: Lang -> Lang -> Langunion (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)
Breaking down into states
Examining further, we can notice the left-linear closure can be broken into a state machine.
left_linear_closure_state :: S -> Intleft_linear_closure_state Elem0 = 1left_linear_closure_state Elem1 = 2left_linear_closure_state Elem2 = 3left_linear_closure_state Term = 4left_linear_closure_state Plus = 5left_linear_closure_state Mul = 6left_linear_closure_state LeftP = 7left_linear_closure_state RightP = 8transitions :: Int -> S -> Maybe Inttransitions 0 _ = Nothingtransitions 1 LeftP = pure 9-- [LeftP, Elem0, RightP],-- [LeftP, Elem0, RightP, Mul, Elem1],-- [LeftP, Elem0, RightP, Mul, Elem1, Plus, Elem0],-- [LeftP, Elem0, RightP, Plus, Elem0],transitions 1 Term = pure 10-- [Term],-- [Term, Mul, Elem1],-- [Term, Mul, Elem1, Plus, Elem0],-- [Term, Plus, Elem0]]transitions 2 LeftP = pure 11-- [LeftP, Elem0, RightP],-- [LeftP, Elem0, RightP, Mul, Elem1],transitions 2 Term = pure 12-- [Term],-- [Term, Mul, Elem1]]transitions 3 LeftP = pure 13-- [LeftP, Elem0, RightP],transitions 3 Term = pure 0-- [Term]]transitions 4 Term = pure 0transitions 5 Plus = pure 0transitions 6 Mul = pure 0transitions 7 LeftP = pure 0transitions 8 RightP = pure 0transitions 9 Elem0 = pure 14-- [Elem0, RightP],-- [Elem0, RightP, Mul, Elem1],-- [Elem0, RightP, Mul, Elem1, Plus, Elem0],-- [Elem0, RightP, Plus, Elem0],transitions 10 Mul = pure 15-- [],-- [Mul, Elem1],-- [Mul, Elem1, Plus, Elem0],transitions 10 Plus = pure 16-- [Plus, Elem0]]transitions 11 Elem0 = pure 17-- [Elem0, RightP],-- [Elem0, RightP, Mul, Elem1],transitions 12 Mul = pure 18-- [],-- [Mul, Elem1]]transitions 13 Elem0 = pure 19-- [Elem0, RightP],transitions 14 RightP = pure 20-- [RightP],-- [RightP, Mul, Elem1],-- [RightP, Mul, Elem1, Plus, Elem0],-- [RightP, Plus, Elem0],transitions 15 Elem1 = pure 21-- [Elem1],-- [Elem1, Plus, Elem0],transitions 16 Elem0 = pure 0-- [Elem0]]transitions 17 RightP = pure 22-- [RightP],-- [RightP, Mul, Elem1],transitions 18 Elem1 = pure 0-- [Elem1]]transitions 19 RightP = pure 0-- [RightP],transitions 20 Mul = pure 15-- [],-- [Mul, Elem1],-- [Mul, Elem1, Plus, Elem0],transitions 20 Plus = pure 16-- [Plus, Elem0],transitions 21 Plus = pure 16-- [],-- [Plus, Elem0],transitions 22 Mul = pure 18-- [],-- [Mul, Elem1],transitions _ _ = Nothingexpectations :: Int -> [S]expectations 0 = []expectations 1 = [LeftP, Term]expectations 2 = [LeftP, Term]expectations 3 = [LeftP, Term]expectations 4 = [Term]expectations 5 = [Plus]expectations 6 = [Mul]expectations 7 = [LeftP]expectations 8 = [RightP]expectations 9 = [Elem0]expectations 10 = [Mul,Plus]expectations 11 = [Elem0]expectations 12 = [Mul]expectations 13 = [Elem0]expectations 14 = [RightP]expectations 15 = [Elem1]expectations 16 = [Elem0]expectations 17 = [RightP]expectations 18 = [Elem1]expectations 19 = [RightP]expectations 20 = [Mul,Plus]expectations 21 = [Plus]expectations 22 = [Mul]terminations :: Int -> Boolterminations 0 = Trueterminations 10 = Trueterminations 12 = Trueterminations 20 = Trueterminations 21 = Trueterminations 22 = Trueterminations _ = Falsestart :: Intstart = 16
Derivation step becomes a really simple transition through the state machine.
derive :: Int -> S -> Maybe Intderive xs s = transitions xs s
The language model flips from set of words to states.
data Lang = Lang {unlang :: Set.Set (Int, LangRef), epsilon :: Bool} deriving (Show)In the recognizer, we upgrade the rstep. Now it introduces the derivative only if we derive something.
rstep lang t s = docase left_linear_closure_state s `derive` t ofNothing -> pure emptyJust xs -> doi <- introduce (derivative lang s)pure (prepend xs i)
The language model is changed to operate on state transitions.
expect :: Lang -> [S]expect (Lang slang _) = Set.toList (Set.fromList (join (fmap step (Set.toList slang))))where step :: (Int, LangRef) -> [S]step (xs, lang) = expectations xsroot :: Langroot = Lang Set.empty Trueprepend :: Int -> LangRef -> Langprepend xs lang = if terminations xs then if length (expectations xs) == 0then unref langelse Lang (Set.singleton (xs, lang)) False `union` unref langelse Lang (Set.singleton (xs, lang)) Falsederivative :: Lang -> S -> Langderivative (Lang slang _) s = foldl union empty (fmap step (Set.toList slang))where step :: (Int, LangRef) -> Langstep (xs, lang) = case transitions xs s ofJust ys -> prepend ys langNothing -> emptyempty :: Langempty = Lang Set.empty Falseunion :: Lang -> Lang -> Langunion (Lang a e1) (Lang b e2) = Lang (a <> b) (e1 || e2)
Now I'm done with the recognizer for now.
Mid-conclusion
Consisting of roughly 70 lines of Haskell.
This algorithm is conceptually simple and efficient except to one problem: What should we do with left-recursive grammars?
Left recursion on left linear closure
Lets take a different grammar, an one that's left recursive.
E -> PE -> E Plus PP -> RP -> P Mul RR -> TermR -> LeftP E RightP
I illustrate how to resolve recursive rules when constructing the left linear closure. This isn't in the paper, I made it up by examining and thinking about the problem a bit.
First number the rules and flip the lhs symbol to suffix.
P {0 E}E Plus P {1 E}R {2 P}P Mul R {3 P}Term {4 R}LeftP E RightP {5 R}
Then select one rule and derive it once. We mark up which rule was pasted in, by suffix.
P {0 E}───────────────────R {2 P} {0 E}P Mul R {3 P} {0 E}
Notice that rules 1 and 0:3 here are recursive,
and we know it because we just marked it.
P Mul R {3 P} {0 E}E Plus P {1 E}
We put these rules aside and continue on the remaining rule.
R {2 P} {0 E}────────────────────────────────Term {4 R} {2 P} {0 E}LeftP E RightP {5 R} {2 P} {0 E}
Next we form regular expressions by pasting the remaining rules in.
How do we do that?
Well, observe that the rule E Plus P {1 E} is recursive with toplevel,
we'll drop it there.
Term {4 R} {2 P} {0 E} (Plus P {1 E})*LeftP E RightP {5 R} {2 P} {0 E} (Plus P {1 E})*
The second recursive rule P Mul R {3 P} {0 E} is recursive on second level.
We'll clip and paste it before {0 E} -subsequence.
If you had a longer subsequence, you'd clip it up appropriately
and paste it before the occurrence of a whole subsequence.
Term {4 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*LeftP E RightP {5 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*
Repeating these steps, we'll get.
Term {4 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*LeftP E RightP {5 R} {2 P} (Mul R {3 P})* {0 E} (Plus P {1 E})*Term {4 R} {2 P} (Mul R {3 P})*LeftP E RightP {5 R} {2 P} (Mul R {3 P})*Term {4 R}LeftP E RightP {5 R}
You can do this transform in any order and it ought produce the same result.
Language edit distance
The another interesting case is the language edit distance/error correcting parser. We can use an old trick to formulate it.
Aho, A. V., & Peterson, T. G. (1972). A Minimum Distance Error-Correcting Parser for Context-Free Languages. SIAM Journal on Computing, 1(4), 305–312. doi:10.1137/0201022We create a grammar with corrections, that covers the existing grammar:
- Replace every terminal
eby a nonterminalE(e). - Replace start symbol by start'.
Add the following productions:
start' → start start' → start H H → H I H → I
For every terminal
ain the grammar, add the following:E(a) → a E(a) → b for every other terminal. E(a) → H a E(a) → I → a
The covering grammar matches every input. Either by ignoring it (the I rules), replacing it (the E(a) → b), or inserting it (the empty rules).
We can avoid some complexity here by modifying the recognizer.
First, sample the next set by previous set, this replaces all H -rules.
Next, shift with every terminal symbol, this replaces the replacement rules.
Finally what we are left with is the empty rule for every terminal. It can be replaced by an adjustment in the preprocessing phase.
Adjusting for empty rules, the grammar explodes in size. I wrote a test program to construct a left-linear closure with emptiable rules. We get 186 rules for our simple grammar and I think I didn't do it correctly.
Conclusion
If you have a suitable grammar then it's an incredibly simple algorithm. Emptiable rules and left recursive grammars add complexity though.
Overall I like what I've seen so far but I find the paper challenging to understand. There's continuation to form an article series. I think the next step would be to look closer at emptiable rules, then look into the parsing -part.