Earley parsing in Idris: WellFounded (Part 2)

In this series we write a certified Earley recognizer and I tell how it went.

In the second part we discuss well-founded relations as they appear in Idris, then represent string generation through shared packed parse forests.

More Idris-suckage

Although Idris is mostly working, it tends to be really picky over which seemingly similar ways you use to represent your program. Sometimes the pattern matching declarations confuse it while other times the case expressions don't work.

Similarly sometimes the code starts working when you drop it out from the where-clause. This isn't a well-made programming environment you'd want to use.

Too bad everything else sucks even worse after you've figured out what you can do with Idris. Going back from Idris to Python now feels like going from Python to Visual Basic. Going to statically typed languages is like going straight down to coding Cobol or Algol.

WellFounded

Well-founded relations are easier to see than understand. Idris has this wfRec -function in its library.

wfRec : WellFounded rel =>
  (step : (x : a) -> ((y : a) -> rel y x -> b) -> b)
  -> a -> b

Here's a common way to structure programs around wfRec:

SomeRelation : A -> A -> Type
WellFounded SomeRelation

doSomething : A -> b
doSomething = wfRec step
  where step : (x : A) -> ((y : A) -> SomeRelation y x -> b) -> b
        step x rec = ?some_b

We can define recursive functions as long as we prove our recursion is covered by a well-founded relation.

Relation seem to be well-founded if it's accessible over all values it relates:

interface WellFounded (rel : a -> a -> Type) where
  wellFounded : (x : _) -> Accessible rel x

Accessibility is difficult to understand, and it's described by the following construction:

data Accessible : (rel : a -> a -> Type) -> (x : a) -> Type where
  Access : (rec : (y : a) -> rel y x -> Accessible rel y)
    -> Accessible rel x

Lets form some sort of a relation:

data Relation1 : Nat -> Nat -> Type where
  SAB : (S a = b) -> Relation1 a b

sab_acc : (x : Nat) -> (y : Nat) -> Relation1 y x -> Accessible Relation1 y
sab_acc Z Z (SAB Refl) impossible
sab_acc (S k) _ (SAB Refl) = Access (sab_acc k)

WellFounded Relation1 where
  wellFounded x = Access (sab_acc x)

Unfortunately we aren't smarter after looking at Accessible -construction. Fortunately once we have such relations, it seems we can form recursive functions on them if we happen to have them.

Idris library contains several variations in the WellFounded -library. One is the induction by size. We can use this to do implement a transitive closure over a list.

transitive_closure : (s -> w -> Bool) -> (w -> s)
  -> List w -> List s -> List w
transitive_closure {s} {w} ck conv g vs = let
  ind = sizeInd {a=List w} {P=TCR s w}
  in ind step g vs
  where step : (x : List w)
    -> ((y : List w) -> Smaller y x -> TCR s w y)
    -> TCR s w x
        step g rec [] = []
        step g rec (a::aa) = case inspect (bin (ck a) g) of
          (With (ll,rr) prf) => case inspect (length rr) of
            (With Z prf2) => step g rec aa
            (With (S z) prf2) => let
              ree = rec ll (bin_smaller (ck a) prf (flip prf2) {k=g})
              in rr ++ ree (map conv rr)

Unfortunately the type signature is not precise enough to provide the necessary proofs for us to verify the parsing algorithm with this, but the recipe is there. To calculate the transitive closure we repeat some single step first on the whole set then proceed to repeat it on subsets of that set.

SPPF/PT structures

The "Generates" -relation from the previous post, that described correct behavior of the software, quite resembled like a parse tree. This is a recurring thing in proving things correct as the proofs tend to convey interesting behavior.

Since parse trees can convey the idea that the grammar generates some string, I was encouraged to pick up structures that is commonly used with Earley parsers. Shared packed parse forests.

data SPPF : (g:Grammar s) -> SPPS s g -> List s -> Type where
  Leaf : SPPF g (Sym s) [s]
  Short : (rule:Index g) -> SPPF g (Sym x) xs
    -> {auto short_rule:(rhs (nth rule) = [x])}
    -> {auto z_is:(lhs (nth rule) = z)}
    -> SPPF g (Sym z) xs
  Branch : SPPF g (prev_dot rule dot) xs -> SPPF g (Sym (nth dot)) ys
    -> {default concat_theorem ok:(k:Index zs ** cut k = (xs,ys))}
    -> {auto z_is:(this_dot rule dot = z)}
    -> SPPF g z zs
  DeepShort : (rule:Index g) -> SPPF g (Sym x) xs
    -> {auto short_rule:(rhs (nth rule) = [x])}
    -> {auto z_is:(lhs (nth rule) = z)}
    -> SPPF g (Sym z) xs
    -> SPPF g (Sym z) xs
  DeepBranch : SPPF g (prev_dot rule dot) xs -> SPPF g (Sym (nth dot)) ys
    -> {default concat_theorem ok:(k:Index zs ** cut k = (xs,ys))}
    -> {auto z_is:(this_dot rule dot = z)}
    -> SPPF g z zs
    -> SPPF g z zs

I wanted to convey it like it appears in papers and books, so we have chains of branches and short chains. The parsed symbol is replaced by a parsing state, which may be either a completed symbol or a partial item.

data SPPS : (s:Type) -> Grammar s -> Type where
  Sym : s -> SPPS s g
  Dot : {g:Grammar s} -> (rule:Index g) -> Index (rhs_tail (nth rule)) -> SPPS s g

We can easily prove that we won't have non-empty sppf-structures:

sppf_nonempty : SPPF g s zs -> NonEmpty zs
sppf_nonempty Leaf = IsNonEmpty
sppf_nonempty (Short _ x) = sppf_nonempty x
sppf_nonempty (Branch _ _ {ok=(k**_)}) = index_nonempty k
sppf_nonempty (DeepShort _ _ z) = sppf_nonempty z
sppf_nonempty (DeepBranch _ _ _ {ok=(k**_)}) = index_nonempty k

Now the SPPF actually stands for multiple parse trees. We can describe the corresponding parse tree by dropping the chaining structures:

data PT : (g:Grammar s) -> SPPS s g -> List s -> Type where
  SLeaf : PT g (Sym s) [s]
  SShort : (rule:Index g) -> PT g (Sym x) xs
    -> {auto short_rule:(rhs (nth rule) = [x])}
    -> {auto z_is:(lhs (nth rule) = z)}
    -> PT g (Sym z) xs
  SBranch : PT g (prev_dot rule dot) xs -> PT g (Sym (nth dot)) ys
    -> {default concat_theorem ok:(k:Index zs ** cut k = (xs,ys))}
    -> {auto z_is:(this_dot rule dot = z)}
    -> PT g z zs

The structure cannot be built if you don't actually have a correct result, so having this parse forest over some string implies you have at least one recognized interpretation of the input. It can be confirmed, we can convert any sppf to some parse tree it represents.

sppf_to_any_pt : SPPF g s zs -> PT g s zs
sppf_to_any_pt Leaf = SLeaf
sppf_to_any_pt (Short rule y {short_rule=short_rule} {z_is=z_is}) =
  SShort rule (sppf_to_any_pt y) {short_rule=short_rule} {z_is=z_is}
sppf_to_any_pt (Branch {rule} {dot} {ok} {z_is} x y) =
  SBranch {rule=rule} {dot=dot} (sppf_to_any_pt x) (sppf_to_any_pt y) {ok=ok} {z_is=z_is}
sppf_to_any_pt (DeepShort rule y w {short_rule=short_rule} {z_is=z_is}) = sppf_to_any_pt w
sppf_to_any_pt (DeepBranch {rule} {dot} {ok} {z_is} x y z) = sppf_to_any_pt z

Then we can conclude that if it's impossible to get a parse tree, then it's also impossible to get a SPPF. We can prove this through contraposition.

pt_theorem : Not (PT g z zs) -> (SPPF g z zs) -> Void
pt_theorem f = (f . sppf_to_any_pt)

The recognizers we have had so far are unable to produce these SPPF structures because they provide too few guarantees about their behavior.

What kind of results can we get?

It seems it's sufficient to prove ¬(PT g z zs) to conclude that some string zs cannot be parsed. But we can't be as loose if we get a positive result.

Ideally we would like to return a SPPF, and that should contain every possible parse tree that can be built with the given constraints.

We have to prove that given any parse-tree a, we can find it from the SPPF that we build:

(a:PT g z zs) -> Contains a p

Though we have a problem we have to treat before we proceed. In some cases we can't build such a "full" SPPF because there are infinite amount of parses!

Consider the following type's inhabitants:

PT [B ::= [A], B ::= [B]] (Sym B) [A]

That's right, we can build...

(rule 1) $ (rule 0) (Leaf)
(rule 1) $ (rule 1) $ (rule 0) (Leaf)
(rule 1) $ (rule 1) $ (rule 1) $ (rule 0) (Leaf)
(rule 1) $ (rule 1) $ (rule 1) $ (rule 1) $ (rule 0) (Leaf)
...

Yet we can't build an infinite SPPF that would cover it. It's also not something that we'd even want. It's nice that our SPPFs are finite. To account for this, we have to add that the structures we recognize won't be redundant. They have to satisfy the following non-redundancy condition.

data NonRedundant : List (Index g) -> PT g z zs -> Type where
  LeafND : NonRedundant [] SLeaf
  ShortND : NonRedundant a x
    -> Not (Elem rule a)
    -> NonRedundant (rule::a)
        (SShort rule x {short_rule=short_rule} {z_is=z_is})
  BranchND : NonRedundant _ x -> NonRedundant _ y
    -> NonRedundant [] (SBranch x y {rule=rule} {dot=dot} {ok=ok} {z_is=z_is})

And we can finally tell what condition we want to hold for the SPPF we want to examine:

sppf_is_complete : (p:SPPF g z zs) -> Type
sppf_is_complete {g} {z} {zs} p =
  ({rs:List (Index g)} -> (a:PT g z zs) -> NonRedundant rs a -> Contains a p)

The conclusion is that our recognizer should fill up the following structure:

data ParseResult : (g:Grammar s) -> SPPS s g -> List s -> Type where
  Parse : (p:SPPF g z zs) -> sppf_is_complete p -> ParseResult g z zs
  NoParse : {g:Grammar s}
    -> Not (PT g z zs)
    -> ParseResult g z zs

Technically this is still insufficient, because our parser knows more than this. I'm just not sure how to describe an idea of a "viable prefix".

The "Containment" is a fairly simple relational statement, though I'd hope to make these simpler:

data Contains : PT g z zs -> SPPF g z zs -> Type where
  LeafIn : Contains SLeaf Leaf
  ShortInTop : Contains x xs -> Contains
    (SShort rule x {short_rule=short_rule} {z_is=z_is})
    (Short  rule xs {short_rule=short_rule} {z_is})
  BranchInTop : Contains x xs -> Contains y ys -> Contains
    (SBranch x y {rule=rule} {dot=dot} {ok=ok} {z_is=z_is})
    (Branch  xs ys {rule=rule} {dot=dot} {ok=ok} {z_is=z_is})
  ShortIn : Contains x xs -> Contains
    (SShort rule x {short_rule=short_rule} {z_is=z_is})
    (DeepShort rule xs _ {short_rule=short_rule} {z_is})
  BranchIn : Contains x xs -> Contains y ys -> Contains
    (SBranch x y {rule=rule} {dot=dot} {ok=ok} {z_is=z_is})
    (DeepBranch  xs ys _ {rule=rule} {dot=dot} {ok=ok} {z_is=z_is})
  ShortInDeep : Contains x ys -> Contains x
    (DeepShort rule xs ys {short_rule=short_rule} {z_is})
  BranchInDeep : Contains x zs -> Contains x
    (DeepBranch  xs ys zs {rule=rule} {dot=dot} {ok=ok} {z_is=z_is})

I suppose to build the proof, we have to construct items like before, except that they carry the SPPF.

data Item : (g:Grammar s) -> List s -> Type where
  Pr : (rule:Index g) -> Item g xs
  Sh : (rule:Index g)
    -> (dot:Index (rhs_tail (nth rule)))
    -> (p:SPPF g (prev_dot rule dot) zs ** sppf_is_complete p)
    -> (k:Index xs ** (cut k = (_,zs)))
    -> Item g xs

To construct anything at all, we need the base case that is trivial. I suppose it'd be the symbol shifting with empty grammar, it always succeeds:

lemma_shift_symbol : (a : PT [] (Sym x) [x]) -> NonRedundant pr a -> Contains a Leaf
lemma_shift_symbol SLeaf x = LeafIn
lemma_shift_symbol (SShort rule z) x = absurd rule
lemma_shift_symbol (SBranch {rule} y z) x = absurd rule

shift_symbol : (sym:s) -> (p:SPPF [] (Sym sym) [sym] ** sppf_is_complete p)
shift_symbol sym = (Leaf ** lemma_shift_symbol)

Now I have to tell you that I haven't gotten further than this. I'd have to correctly define what is an item set, and the parser itself.

ParseState : (g:Grammar s) -> SPPS s g -> List s -> Type

init : (g:Grammar s) -> (acc:s) -> ParseState g (Sym acc) []
shift : (g:Grammar s) -> ParseState g z zs -> (sym:s) -> ParseState g z (zs ++ [sym])
check : ParseState g z zs -> ParseResult g z zs

If we had those, then we could build this:

append_lemma : (ys ++ []) = ys
append_lemma {ys = []} = Refl
append_lemma {ys = (x :: xs)} = cong (append_lemma {ys=xs})

append_lemma_2 : (ys:List a) -> (x:a) -> (xs:List a) -> ((ys ++ [x]) ++ xs) = ys ++ (x :: xs)
append_lemma_2 [] x xs = Refl
append_lemma_2 (y :: ys) x xs = cong (append_lemma_2 ys x xs)

parse : DecEq s => (g : Grammar s) -> (acc:s) -> (input:List s) -> 
  ParseResult g (Sym acc) input
parse g acc xs = check $ step g xs $ init g acc
  where step : (g:Grammar s) -> (xs:List s) -> ParseState g z ys -> ParseState g z (ys++xs)
        step g [] pst {ys} = rewrite append_lemma {ys=ys} in pst
        step g (x :: xs) pst {z} {ys} = let
          pst2 = shift g pst x {z=z} {zs=ys}
          pstn = step g xs pst2 {z=z} {ys=ys ++ [x]}
          in rewrite flip $ append_lemma_2 ys x xs in pstn

The difficult parts are still up ahead and I'm not sure if I am able to finish it.

Similar posts