Optimizing RPython JIT for Pyllisp

I spent this weekend optimizing Pyllisp. People at irc.freenode.net#pypy -channel helped and taught me things about their JIT.

Benchmarks

You aren't optimizing anything without benchmarks. Preferably you'd have some really useful program your interpreter is supposed to run or at least you can verify your benchmark produces correct results.

I just decided to take something that would take long in python, so I started building loops around arithmetic, like this:

i = 0
j = 0
k = 0
(while i < 5000000)
    (while j < 5000000)
        (while k < 5000000)
            (assert i+j*k >= 0)
            k += 1
        j += 1
    i += 1

I wrote the same code in python of course:

i = 0
j = 0
k = 0
while i < 5000000:
    while j < 5000000:
        while k < 5000000:
            assert i+j*k >= 0
            k += 1
        j += 1
    i += 1

It's not a good benchmark, but it took 2.5 seconds to run for CPython, 0.13 seconds for PyPy and 13 seconds for my interpreter. On the retrospect, it demonstrates advantages of iterators.

I also tried it in C, but apparently for C it's not a problem because it exits instantly even without optimization flags:

#include <stdio.h>

int main() {
    long i, j, k;
    long res;
    while(i < 5000000) {
        while(j < 5000000) {
            while(k < 5000000) {
                res = i+j*k;
                if (res < 0) printf("plop\n");
                k+=1;
            }
            j+=1;
        }
        i+=1;
    }
    return 0;
}

Unfortunately it demonstrates advantages of brevity. The slower and corrected variation of this same code takes 40ms on -O0, 20ms on -O1, 2ms on -O2. The -O2 likely just optimizes the whole thing away and results in empty program.

On CPython the above benchmark stresses dictionary lookup and function call dispatchs. It has that optimized, I'd imagine as good as they can get. But why pyllisp is more than 5 times slower than python in this benchmark? By quick glance into the source I would guess it's because pyllisp is interpreting from a control flow graph in memory. I don't know either how RPython compiled C compares to hand-written C.

Enabling JIT

If you've got an interpreter written in RPython, you're few lines away from having a JIT. How to get that has been explained in the pypy status blog.

After compiling with rpython --translation-jit main.py, it goes through the benchmark in 6 seconds. That's twice faster than where we started from. I'd guess it manages to avoid some calls and memory allocations inside the loops.

To help optimizing it further, RPython provides a trace log about how JIT transforms the code:

PYPYLOG=jit-log-opt:logfile ./main-c benchmarks/addmul_loop

This trace reveals what extra JIT is compiling in aside the actual computation going on in the program.

Immutable Fields

The trace log revealed that JIT was reading fields in the control flow graph, although the control flow doesn't change when JIT is running. To communicate this detail about those nodes, we can declare those fields immutable using _immutable_fields_ -parameter:

class Call(ValuedOp):
    _immutable_fields_ = ['callee', 'args[*]']
    def __init__(self, callee, args):
        self.callee = callee
        self.args = args[:]

The brackets and stars -notation mean that the value is a list or a dict that doesn't change its contents.

Another thing I noticed is that my interpreter was treating interface in my objects as a mutable value, although it's a constant pointing to the type information of that object. Using _attrs_ -parameter took care of that problem:

class Integer(Object):
    _immutable_fields_ = ['value']
    _attrs_ = ['value']
    def __init__(self, value):
        self.value = value

An additional advantage here is that the compiler itself were more aware about what the objects are. It pointed out several possible issues, where the program could have crashed.

These optimizations shaved about 2 seconds from the time our interpreter spent in the benchmark.

Promote

jit.promote allows to add a guard around the value. Afterwards the JIT expects that the value is a constant in the trace. Knowing that something appears often as a constant allows the JIT to optimizations based on the value.

Elidables

The second thing our trace log shown is that we're doing lots of global value lookups. It's similar to what CPython is doing. It can't complete faster because it's constantly looking up names from the dictionary entries.

At a glance there would be nothing we can do about it because we have to lookup those values from the global module space. Otherwise we can't change values inside our modules.

Things change if we introduce little bit of indirection. Lets say our module dictionary returns a slot instead of a value.

class Cell:
    def __init__(self, slot):
        self.slot = slot

@jit.elidable
def lookup(name):
    return namespace[name]

cell = lookup(name)
return cell.slot

The jit.elidable -decorator tells JIT, that for same arguments the function returns same value. This allows the JIT to drop the subsequent dictionary lookups and just use the cell value directly.

Unroll Safe Functions

jit.unroll_safe tells JIT that it can unroll the loops inside the function. For example, I recently added this into my interpreter call:

@jit.unroll_safe
def do_call(frame, callee, args):
    argv = []
    for arg in args:
        argv.append(frame.load(arg.i))
    return callee.call(argv)

Now it knows to unroll my function call arguments.

Here's an example of elidable and unroll_safe together, in multimethods implementation: space/multimethod.py

These optimizations cut the benchmark time down to about 0.3 seconds.

Virtualizables

After the optimizations above, there were still things on the code path. The JIT kept shuffling things in and out from the array meant for temporary values.

To alleviate this the RPython describes virtualizables. They are objects which end up copied out and treated as something explicitly mutated by the interpreter operating on them.

Getting them wrong produces some of the worst errors to correct. But otherwise they're nice to use:

class Frame:
    _virtualizable_ = ['tmp[*]']
    def __init__(self, tmp):
        self = jit.hint(self, access_directly=True, fresh_virtualizable=True)
        self.tmp = tmp

    @always_inline
    def store(self, index, value):
        assert index >= 0
        self.tmp[index] = value

    @always_inline
    def load(self, index):
        assert index >= 0
        return self.tmp[index]

RPython documentation has a section for virtualizables.

This construct shaved the benchmark time down to 0.16s. Improving and learning about these things took for few days. That's a kind of amazing to me. I had no idea I could get so close to pypy performance on this benchmark, in such short time.

Aftermath

The performance gained through observing one benchmark is sort of fragile. We're only aware about the performance inside this one benchmark.

Say instead of the earlier program, you would run:

loop = (func)
    i = 0
    j = 0
    k = 0
    (while i < 5000000)
        (while j < 5000000)
            (while k < 5000000)
                (assert i+j*k >= 0)
                k += 1
            j += 1
        i += 1
(loop)

It takes 1.2 seconds to run this benchmark. If I do a trace, it'll show a visit on the var - I haven't optimized my closures.

When the benchmark produces millisecond differences, it is no longer giving accurate idea about the relative performance. Later I improved the addmul_loop -benchmark like this:

j = 0
(while j < 500)
    k = 0
    (while k < 5000000)
        (assert k+j*k >= 0)
        k += 1
    j += 1

This took 18 seconds from pyllisp. Evaluating the equivalent program takes 15 seconds from PyPy and 8 minutes from CPython.

Can you do better than Pypy? Yes, if you study your trace logs! Though be aware that annotation in wrong place can also destroy the performance. Also the optimization may also be too burdensome in terms of bureaucracy it introduces. It takes several minutes before you can test it. Therefore all the optimizations need to be evidence driven:

There needs to be a clue that the optimization is effective
The effectiveness must be measured
Implement more generic optimizations before the specifics.

I observed that I have quite many guards on the code path. The user might change the meaning of + operation instead of just overriding it. Thing is you could change it, although it wouldn't be common to do that kind of thing. For that reason you've got a guard that looks out for the + -slot for case someone has changed the implementation.

I implemented following behavior: If the user accesses undefined slot in her module, a frozen slot is lifted from globals. She can't change the value after lifting it, but she can set it before accessing it.

Many builtin modules are 'frozen', so the user can't change them. That allows inlining access to them without inserting a guard. The difference to performance? On the improved benchmark it gets 13 seconds. So there you have it.