Pyllisp grammar (and parser) slow but flexible

I wrote pyllisp its own bytecode format and a grammar language. I have implemented some parts of the language. The results seem good this far, except that it's all a bit too slow.

To introduce a language construct, I need to provide few new instructions, give my compiler some commands that produces those instructions and add few rules into grammar that invoke those commands.

The grammar file consists of rules like this:

term =>
    lookup:     symbol
    int:        int
    float:      float
    string:     string
    pass(expr): lp:"(" expr rp:")"

And the compiler commands look like this:

def post_lookup(env, loc, symbol):
    # XXX: replace with real lookup
    return env.vop(loc, 'gglo',
        Constant(loc, symbol.value))

def post_int(env, loc, num):
    return env.vop(loc, 'cnst',
        Constant(loc, int(num.value)))

def post_float(env, loc, num):
    return env.vop(loc, 'cnst',
        Constant(loc, float(num.value)))

I find this kind of structuring useful because it isolates the actions compiler has to take from the parsing step. Practically it further makes the language implementation independent from the syntax of the language.

My parser is slow

I expected the parsing to be lot slower than the deterministic hand-written parser I had. Turns out it is slow enough to affect the usability of the language. Parsing 1000 lines on my computer takes 2 seconds. Some rules I will still have to introduce will slow down the parsing further.

Profiling it out points out that the time is spent in python's ordereddict.

Since it's a chart parser, progression in parsing requires that the user creates a new chart and fills it with the intermediate parsing results. To parse through 1 million tokens, program has to create 1 million chart cells.

The construction of ordered ditionaries takes lot more time than construction of lists or sets:

>>> import collections, time
>>> def test(n):
...     a = []
...     now = time.time()
...     for i in range(n):
...         a.append(collections.OrderedDict())
...     return time.time() - now
>>> def test2(n):
...     a = []
...     now = time.time()
...     for i in range(n):
...         a.append([])
...     return time.time() - now
>>> test(1000000)
>>> test2(1000000)

Therefore I may get fairly modest performance improvement by switching away from OrderedDict. Into a set and deque, or set and a list.

I may need further performance improvements into the parsing step, but the profiling of my compiler pointed out at least one another possible performance bottleneck outside the parser.

Similar posts