Invertible lexical analysis & Regular expressions

I've left the lexical analysis aside when I've been writing tools to parse languages. Though ideally we'd have some good pack of tools to read and produce lexemes.

Lexical analysis partitions sequences of characters into lexemes and then identifies them for the parser.

Regular expressions?

You usually use a regular language to break the input sequence into pieces.

Regular languages are defined through the following rules:

Empty language (matching nothing) is a regular language.
Language matching empty string is a regular language.
If A,B are regular languages, A | B (union) and AB (concatenation) are regular languages.
If A is a regular language A* (repeated) is a regular language.

We also have context-free grammars for these kind of languages. They are called regular grammars. Such grammars consist of only certain kind of production rules:

A →
A → aB

A,B stand for symbols and a stand for a character of some alphabet.

As long as the rules in the grammar are consistent in how they structure the last production rule may also be mirrored. But I picked the right regular grammars because they have a trivial scheme for detecting whether a string is in the grammar.

You can trivially produce a finite automaton out of this kind of a grammar.

Rename symbols into states.
Mark every state that may be empty as accepting.
For every A → aB, mark an available transition through character 'a'.

You can obtain a parse tree by recording which transitions were made. Therefore you can think of such an automaton as a parser.

Here's a small example of a grammar that recognizes strings that only match strings where 'a' repeats many times, with 'b' repeating twice consequently if present.

   A → 
0. A → aA
1. A → bB
2. B → bA

I added small labels for the rules to distinguish them from characters and symbols.

Ok, what kind of strings can be generate with this grammar? The empty string is the obvious one.

You may already notice that we also accept a.

a
A0 A

The string may also match with b, but it won't be accepted.

b
A1 B

Not until it's completed with another b.

b  b
A1 B2 A

And you may verify that only strings such as abbaaaabba are accepted.

You could also express the above language with a regular expression. The expression for the language would be (a|bb)*.

The main motivation with parsing is to recognize strings and/or extract information from them. In this sense it may be interesting to note, what do we need to record such that we can reconstruct the string?

Examining he grammar it's quite obvious that transition from B is always there. Likewise there's only two ways how the string may buildup from A. It's sufficient to see whether left or right branch was taken. So we only need to record whether rule 0 or 1 were taken at each step. That's all it needs to record. Therefore abbaaaabba could be presented with [0,1,0,0,0,0,1,0].

Why not drive a parser directly from input stream?

From the earlier you may have figured out that a lexical analyser is a sort of a parsing step. We got context-free grammars of some kind. You may be wondering why it's not being merged with LR parsing?

I think the main reason to do it is to discard whitespace and detect keywords.

The another reason is to add details that make your language harder to parse with general-purpose tools. Whole lot of detail and variation seem to fit into this gap that forms between characters and lexemes.

Thompson's construction

The regular grammars aren't usually how regular languages are represented. Often you just write a regular expression to represent each lexeme.

Thompson's construction is a method of transforming a regular expression into a nondeterministic finite automaton. The wikipedia page explains the construction quite well. Here's the code:

from collections import namedtuple
from itertools import count

Empty = namedtuple('Empty', [])
Symbol = namedtuple('Sym', ['sym'])
Union = namedtuple('Union', ['left', 'right'])
Concat = namedtuple('Concat', ['left', 'right'])
Repeat = namedtuple('Repeat', ['seq'])

def nfa(rule):
    transitions = set()
    new_state = count().next
    start = new_state()
    stop = new_state()
    build_nfa(start, stop, rule, new_state, transitions.add)
    return start, stop, transitions, new_state()

def build_nfa(start, stop, rule, new_state, new_transition): 
    if isinstance(rule, Empty):
        new_transition((start, stop, None))
    elif isinstance(rule, Symbol):
        new_transition((start, stop, rule.sym))
    elif isinstance(rule, Union):
        start1 = new_state()
        start2 = new_state()
        stop1 = new_state()
        stop2 = new_state()
        new_transition((start, start1, None))
        new_transition((start, start2, None))
        new_transition((stop1, stop, None))
        new_transition((stop2, stop, None))
        build_nfa(start1, stop1, rule.left, new_state, new_transition)
        build_nfa(start2, stop2, rule.right, new_state, new_transition)
    elif isinstance(rule, Concat):
        mid = new_state()
        build_nfa(start, mid, rule.left, new_state, new_transition)
        build_nfa(mid, stop, rule.right, new_state, new_transition)
    elif isinstance(rule, Repeat):
        start0 = new_state()
        stop0 = new_state()
        new_transition((start, stop, None))
        new_transition((start, start0, None))
        new_transition((stop0, stop, None))
        new_transition((stop0, start0, None))
        build_nfa(start0, stop0, rule.seq, new_state, new_transition)
    else:
        assert False

To match a string with the NFA you need to track state sets because nondeterministic automaton may have empty transition loops in them. The number of states correspond to the length of the regular expression and means we can use a bitmap or a vector as marker for every state to track whether state has been already added into a state set or not.

def recognize(rule, string):
    start,stop,transitions,size = nfa(rule)
    vector = [0] * size
    state = [start]
    step(state, state, transitions, vector, None, 1)
    for i, sym in enumerate(string, 2):
        next_state = []
        step(state, next_state, transitions, vector, sym, i)
        step(next_state, next_state, transitions, vector, None, i)
        state = next_state
    return (stop in state)

def step(state, next_state, transitions, vector, sym, position):
    for st in state:
        for a,b,step in transitions:
            if st == a and step == sym and vector[b] != position:
                vector[b] = position
                next_state.append(b)

if __name__=="__main__":
    empty = Empty()
    sym_a = Symbol('a')
    test_rule = Union(empty, sym_a)
    assert recognize(test_rule, "")
    assert recognize(test_rule, "a")
    assert not recognize(test_rule, "aa")

Use powerset construction on NFA tables and you get DFA tables. It's not a complex operation because it means you basically simulate the NFA and see which possible states there can be.

Parsing with regular expressions

If you record where each state was created in the build_nfa, you can use the result to construct a parse tree from the states visited by the recognize -procedure.

If you temporarily put the recognize to print the states, then run the following:

zeros = Concat(
    Repeat(Concat(Symbol('0'), Symbol('0'))),
    Repeat(Concat(Symbol('0'), Concat(Symbol('0'), Symbol('0')))))
assert recognize(zeros, "000000")

You get this:

[0, 3, 2, 1, 6]
[5, 8]
[4, 9, 3, 2, 1, 6]
[7, 5, 8, 6, 1]
[4, 9, 8, 3, 2, 1, 6]
[7, 9, 5, 8, 6, 1]
[7, 4, 9, 8, 6, 1, 3, 2]

The above program runs a regex (00)*(000)*. If you record up where the states were created, you find out they correspond like this:

0          1  root
     2        concat
 3  4 6   7   repeat, repeat
  5    89     concat
 (00)*(000)*

Then look at how the states align with the input string.

If you look at how 0,1,2 align in the graph, you see it recognizes several possible boundaries in the parse:

The second and third boundary aren't valid because the state chart doesn't show valid subsequences there. We have exactly two subsequences:

The information can be encoded such that the original string can be reconstructed from the encoded information. On empty/symbol/concat you don't need to write any information. On union/repeat you can just write down which branch was chosen.

So for example here we'd have two results, here's their encodings with parse trees you could get for them:

[0,1,1,0]      [[],['000','000']]
[1,1,1,0]      [['00','00','00'],[]]

We have two parse trees for the same input string. The regex I gave is an example of an ambiguous regular expression.

Regex ambiguity

I hadn't heard of regex ambiguity before preparing to write this post. I guess it's not a problem because tools never report it and people never notice if they write an ambiguous regex.

It's possible to determine if a regular expression is ambiguous. Let's do that powerset construction for that previous regular expression.

init: 0
fin: 1
(0, 2)
(0, 3)
(2, 1)
(2, 6)
(3, 5, '0')
(4, 2)
(4, 3)
(5, 4, '0')
(6, 8, '0')
(7, 1)
(7, 6)
(8, 9, '0')
(9, 7, '0')

Ok, there's our NFA. Lets fill up the initial state.

0: 0
   1
   2
   3 '0' ---> 5
   6 '0' ---> 8

It seems to go into the 5,8 -state. The 5,8 -state seems to not have edges elsewhere. So the next state is...

1: 5 '0' ---> 4
   8 '0' ---> 9

What does the 4,9 -state look like?

2: 1
   2
   3 '0' ---> 5
   4
   6 '0' ---> 8
   9 '0' ---> 7

We can continue and go to 5,7,8...

3: 1
   5 '0' ---> 4
   6 '0' ---> 8
   7
   8 '0' ---> 9

The 4,8,9 still seems like a new state.

4: 1
   2
   3 '0' ---> 5
   4
   6 '0' ---> 8
   8 '0' ---> 9
   9 '0' ---> 7

And 5,7,8,9 as well requires its own state.

5: 1
   5 '0' ---> 4
   6 '0' ---> 8
   7
   8 '0' ---> 9
   9 '0' ---> 7

The 4,7,8,9 still needs its own state. Variation of fifth state with '7' added in.

6: 1
   2
   3 '0' ---> 5
   4
   6 '0' ---> 8
   7
   8 '0' ---> 9
   9 '0' ---> 7

But it traces back to sixth state so okay.. That's the last state.

So what do we have to observe in order to perceive an ambiguous state? There seem to be two ways in which things can go wrong.

Union merges two intersecting languages.
Concatenation may intersect.

So searching at hints of those conditions, we can think of few things that might cause our regex to produce ambiguous result. Lets look at the state map again.

0          1  root
     2        concat
 3  4 6   7   repeat, repeat
  5    89     concat
 (00)*(000)*

Looking at this thing, we seem to have reduction paths 2-1 and 7-1. These are two different ways how the expression may reduce. If 2 and 7 appear in the same state, then we have ambiguity. That happens in the DFA state 6.

To figure out how to do it, consider how would you build a parse tree from the DFA's output. You could do it in reverse and the first thing to find out would be to ask whether the last sequence repeated or not? If there's '2' in the state, you can conclude it marks a reduction that skipped a repetition step. The '7' marks the opposite because it clues you that the parse tree contains a repetition.

Code for NFA -> DFA buildup

Here's something to build up the DFA if you happen to need it.

def dfa(rule):
    start,stop,transitions,size = nfa(rule)
    vector = [0] * size
    mapping = {}
    dfa_states = []
    dfa_transitions = []
    state = nfa_state_to_dfa_state([start], transitions, vector, 1)
    dfa_states.append(state)
    k = 2
    for i, state in enumerate(dfa_states):
        for step, nst in build_dfa_transitions(state, transitions):
            nst = nfa_state_to_dfa_state(nst, transitions, vector, k)
            k += 1
            try:
                j = mapping[nst]
            except KeyError as _:
                j = len(dfa_states)
                mapping[nst] = j
                dfa_states.append(nst)
            dfa_transitions.append((i, j, step))
    return dfa_states, dfa_transitions

def nfa_state_to_dfa_state(state, transitions, vector, position):
    step(state, state, transitions, vector, None, position)
    return tuple(sorted(state))

def build_dfa_transitions(state, transitions):
    block = {}
    for a,b,step in transitions:
        if (step is not None) and (a in state):
            try:
                st = block[step]
            except KeyError as _:
                st = block[step] = set()
            st.add(b)
    for step, nst in block.iteritems():
        yield step, list(nst)

For the 'zeros' it prints out the following state mapping:

[(0, 1, 2, 3, 6),
 (5, 8),
 (1, 2, 3, 4, 6, 9),
 (1, 5, 6, 7, 8),
 (1, 2, 3, 4, 6, 8, 9),
 (1, 5, 6, 7, 8, 9),
 (1, 2, 3, 4, 6, 7, 8, 9)],
[(0, 1, '0'),
 (1, 2, '0'),
 (2, 3, '0'),
 (3, 4, '0'),
 (4, 5, '0'),
 (5, 6, '0'),
 (6, 5, '0')])

Seven states like before, with the exact same transitions that we made by hand earlier on.

Remapping of regexes

Now consider a different kind of a regex:

Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec

Then you have this kind of a regular expression with same structure:

Tammi|Helmi|Maalis|Huhti|Touko|Kesä|Heinä|Elo|Syys|Loka|Marras|Joulu

Okay so "Helmi" would be encoded as [1,0]. This means that you can transform "Helmi" to "Feb", and vice versa.

If it wasn't obvious already, this mapping is only valid if you got an unambiguous grammar and two regular expressions have the same structure.

You can extend this to any structure if you have invertible mapping from regular expression to that structure. For example we can use this in lexical analysis:

Helmi <--> [1,0] <--> (MONTH, 2)

There's a bit of an obstacle to this. You need an invertible function that maps between regular expressions and lexemes.

Code snippet: An actual regex parsing algorithm

It'd be a bit disappointing to present the idea of rewriting from regular expression to an another and not show it in practice.

Here's some code to readback and writeout with a regular expression.

First the build_nfa needs to collect some additional structure. Basically we just collect some info to build the readback.

def nfa(rule):
    transitions = set()
    new_state = count().next
    start = new_state()
    stop = new_state()
    enc = []
    build_nfa(start, stop, rule, new_state, transitions.add, enc)
    return start, stop, transitions, new_state(), enc

def build_nfa(start, stop, rule, new_state, new_transition, enc):
    if isinstance(rule, Empty):
        new_transition((start, stop, None))
    elif isinstance(rule, Symbol):
        new_transition((start, stop, rule.sym))
        enc.append(("skip", None))
    elif isinstance(rule, Union):
        start1 = new_state()
        start2 = new_state()
        stop1 = new_state()
        stop2 = new_state()
        new_transition((start, start1, None))
        new_transition((start, start2, None))
        new_transition((stop1, stop, None))
        new_transition((stop2, stop, None))
        enc1 = []
        enc2 = []
        build_nfa(start1, stop1, rule.left, new_state, new_transition, enc1)
        build_nfa(start2, stop2, rule.right, new_state, new_transition, enc2)
        enc.append(("union", (stop1, stop2, enc1, enc2)))
    elif isinstance(rule, Concat):
        mid = new_state()
        build_nfa(start, mid, rule.left, new_state, new_transition, enc)
        build_nfa(mid, stop, rule.right, new_state, new_transition, enc)
    elif isinstance(rule, Repeat):
        start0 = new_state()
        stop0 = new_state()
        new_transition((start, stop, None))
        new_transition((start, start0, None))
        new_transition((stop0, stop, None))
        new_transition((stop0, start0, None))
        enc1 = []
        build_nfa(start0, stop0, rule.seq, new_state, new_transition, enc1)
        enc.append(("repeat", (start, stop0, enc1)))
    else:
        assert False

Here's the actual readback. This can be done directly on the NFA. I have marked the points where to detect ambiguity, though I don't really do anything with that information.

def readback(enc, states, out):
    for cmd, arg in reversed(enc):
        if cmd == "skip":
            states.pop()
        elif cmd == "union":
            s0,s1,enc0,enc1 = arg
            k0 = (s0 in states[-1])
            k1 = (s1 in states[-1])
            #if k0 and k1:
            #   mark ambiguous
            assert k0 or k1
            if k0:
                out0 = []
                readback(enc0, states, out0)
                out.append((0, out0))
            if k1:
                out1 = []
                readback(enc1, states, out1)
                out.append((1, out1))
        elif cmd == "repeat":
            s0,s1,enc1 = arg
            outs = []
            while True:
                k0 = (s0 in states[-1])
                k1 = (s1 in states[-1])
                #if k0 and k1:
                #   mark ambiguous
                assert k0 or k1
                if k0:
                    break
                else:
                    out1 = []
                    readback(enc1, states, out1)
                    outs.append(out1)
            out.append(outs)
        else:
            assert False

Now this thing gives you results such as [[[], []], []]. To write them out into structure, have a regex of similar shape and then run writeout on it:

def writeout(cmd, rule):
    if isinstance(rule, Empty):
        return ""
    elif isinstance(rule, Symbol):
        return rule.sym
    elif isinstance(rule, Union):
        side, subcmd = cmd.pop()
        if side == 0:
            return writeout(list(subcmd), rule.left)
        else:
            return writeout(list(subcmd), rule.right)
    elif isinstance(rule, Concat):
        out1 = writeout(cmd, rule.left)
        out2 = writeout(cmd, rule.right)
        return out1 + out2
    elif isinstance(rule, Repeat):
        outs = []
        for subcmd in cmd.pop():
            outs.append(writeout(subcmd, rule.seq))
        return "".join(outs)
    else:
        assert False

The recognizer can be "upgraded" a bit to rewrite strings:

def recognize(rule, string, rewrite_rule=None):
    start,stop,transitions,size,enc = nfa(rule)
    vector = [0] * size
    state = [start]
    step(state, state, transitions, vector, None, 1)
    states = [state]
    for i, sym in enumerate(string, 2):
        #print sorted(state)
        next_state = []
        step(state, next_state, transitions, vector, sym, i)
        #print "STEP", sorted(next_state)
        step(next_state, next_state, transitions, vector, None, i)
        state = next_state
        states.append(next_state)
    #print sorted(state)
    if stop in state:
        out = []
        readback(enc, states, out)
        print "READBACK", list(reversed(out))
        print "WRITEOUT:", writeout(list(out), rule)
        if rewrite_rule is not None:
            print "WRITEOUT (RW):", writeout(list(out), rewrite_rule)
    return (stop in state)

Small sample program:

zeros = Concat(
    Repeat(Concat(Symbol('0'), Symbol('0'))),
    Repeat(Concat(Symbol('0'), Concat(Symbol('0'), Symbol('0')))))
laku = Concat(
    Repeat(Concat(Symbol('l'), Symbol('a'))),
    Repeat(Concat(Symbol('k'), Symbol('u'))))
assert recognize(zeros, "000000000", laku)
assert recognize(laku, "lalalaku", zeros)

And the readout:

READBACK [[[], [], []], [[]]]
WRITEOUT: 000000000
WRITEOUT (RW): lalalaku
READBACK [[[], [], []], [[]]]
WRITEOUT: lalalaku
WRITEOUT (RW): 000000000

I also thought about a compact readback representation where the readback would be just a list of numbers. But as an afterthought it's not important that the readback is compact. It's more important to be able to refine and rewrite the readback.

Keyword handling

Keywords such as 'if' 'else' are usually carved out of identifiers. This means they are no longer identifiers.

Notion of what is a valid identifier extends far into the rest of the language. If you take it seriously it will limit what is accepted as an identifier inside any namespace within your language.

It may be worthwhile to take the backtilde ` or backslash \ and have a scheme to display those as an identifier when something isn't a direct identifier. This allows you to write `else` or \else when you need to print it as an identifier.

Longest sequences as lexemes

When you consider things that you're likely to parse. For example you might rely on this regex to parse an identifier:

(\w|_)(\d|\w|_)*

It'd be likely that you want it to parse an identifier as a single lexeme instead of multiple smaller ones. This means that you want it to do always match the longest sequence.

If you add a requirement that the longest sequence must also "bottom-out". Meaning that the DFA rejects the next letter that comes. This is helpful in two ways.

It becomes obvious when you can mark the lexeme complete and feed it to the parser.
It becomes simple to check whether two lexemes require space in between them: If the DFA does not complete the preceding token, without the space, it needs a space in between.

The approach also may help with repositioning comments into the file.

Btw. If you record the location of the comment (store it into a list) and identify where in the document it appeared (see which parse tree ends up enclosing it). It's quite likely you'll be able to place it back into the correct place when you printout the modified structure.

For now...

That should be everything you need in order to build some very coinvincing regex engine.

I enjoy that I was able to give a short explanation for the parse tree buildup. Hopefully you see how that can be used for the traditional string capture as well.