A dynamically typed register allocator

Earlier this month I presented an assembler written in Lever. Now I have a register allocator for it. I implemented iterated register coalescing and I am going to use it for allocating GPR, SSE registers and the spilled variables of the same size and type.

lib/backend.lc implements that algorithm and it reads very much like the pseudocode in the paper, except that all the details about how it translates into the practice are present.

Motivation

The register allocation is the first step in making an optimized assembly program more portable and more convenient to produce.

With register allocation we no longer have to assign a specific register for every variable. Already dropping this details makes the assembly programming much more tolerable.

Portability is improved because register allocation allows us to vanish the details of the calling conventions. Register allocation is required that we can run the same code on Windows and Linux without having to come up with our own ABI.

Benchmark

Because there's a practical reason why I am doing this, it is interesting to know whether we gain anything by compiling our code. If JIT was running as fast as the compiled code, there would be no point in having a compiling step. Unfortunately this is not the case.

I wrote a benchmark program that plots 20000 non-overlapping circles into an array. Roughly 900 randomly generated circles are discarded. My assembly optimizations attempts focused at the inner loop.

The interpreted version in JIT takes 12 seconds to complete. The assembly version runs in 0.7, and it takes only 0.9 seconds even if we did not coalesce the moves. So it means there can be as much as 10x speed difference between JIT code and optimized assembly. It sounds a lot but the difference used to be a lot more between Python and C.

Below is the ndisasm -b64, side-by-side from the loop that has move instructions coalesced and the another that doesn't have.

 00000000  4D31D2            xor r10,r10                    00000000  4D31D2            xor r10,r10
 00000003  4939F2            cmp r10,rsi                    00000003  4939F2            cmp r10,rsi
 00000006  7D35              jnl 0x3d                       00000006  7D41              jnl 0x49
 00000008  4D6BCA18          imul r9,r10,byte +0x18         00000008  4D6BCA18          imul r9,r10,byte +0x18
 0000000C  4E8B1C0F          mov r11,[rdi+r9]               0000000C  4E8B1C0F          mov r11,[rdi+r9]
 00000010  4A8B440F08        mov rax,[rdi+r9+0x8]           00000010  4A8B440F08        mov rax,[rdi+r9+0x8]
 00000015  4E8B4C0F10        mov r9,[rdi+r9+0x10]           00000015  4E8B4C0F10        mov r9,[rdi+r9+0x10]
 0000001A  4C2BDA            sub r11,rdx                    0000001A  4C2BDA            sub r11,rdx
 0000001D  482BC1            sub rax,rcx                    0000001D  482BC1            sub rax,rcx
 00000020  4D03C8            add r9,r8                      00000020  4D03C8            add r9,r8
                                                            00000023  4D8BDB            mov r11,r11
 00000023  4D0FAFDB          imul r11,r11                   00000026  4D0FAFDB          imul r11,r11
                                                            0000002A  488BC0            mov rax,rax
 00000027  480FAFC0          imul rax,rax                   0000002D  480FAFC0          imul rax,rax
                                                            00000031  4D8BC9            mov r9,r9
 0000002B  4D0FAFC9          imul r9,r9                     00000034  4D0FAFC9          imul r9,r9
                                                            00000038  4D8BDB            mov r11,r11
 0000002F  4C03D8            add r11,rax                    0000003B  4C03D8            add r11,rax
 00000032  4D39CB            cmp r11,r9                     0000003E  4D39CB            cmp r11,r9
 00000035  7E0A              jng 0x41                       00000041  7E0A              jng 0x4d
 00000037  4983C201          add r10,byte +0x1              00000043  4983C201          add r10,byte +0x1
 0000003B  EBC6              jmp short 0x3                  00000047  EBBA              jmp short 0x3
 0000003D  4831C0            xor rax,rax                    00000049  4831C0            xor rax,rax
 00000040  C3                ret                            0000004C  C3                ret
 00000041  4831C0            xor rax,rax                    0000004D  4831C0            xor rax,rax
 00000044  B001              mov al,0x1                     00000050  B001              mov al,0x1
 00000046  C3                ret                            00000052  C3                ret

Anyway we can conclude that the register coalescing works and that the effects it produces does matter.

Elements of the backend

When attempting to optimize the Lever PNG two weeks ago, I made an assembler that used an instruction dataset to encode instructions. It was defined as a function that encodes an instruction and returns a sequence of bytes, with rougly a following command:

output = asm.encode_ins(uid, [operand1, operand2...])

Blocks and Branches

To calculate a relative jump I had to compute the relative offset myself from the buffer. I added some simple logic into it so the compiler could resolve the branch without having to define a variable.

Now I just need to do the following to produce a loop:

jmp = {i32: 886, i8: 887}
jge = {i32: 875, i8: 876} # (SF=0F)

loop = Block()
loop.code = [
    # SUB r64 m64
    Operation(2327, [a, b]),
    Jump(jge, loop, [a, b])
    Jump(jmp, exit, []),
]

The first argument determines which instructions Jumps can resolve to. Of course, if the offset of the jump is zero, the jump will resolve to no code at all. The second argument to a Jump are the variable names. Since the loop depends on the a and b, they are required here.

Because the variable flow is already determined in the extended basic blocks, I decided to not calculate the variable flow across basic blocks on the backend. I think it's ultimately a good decision because there are many ways how I can obtain and need this information long before ending up to the backend.

The whole functions consists of that kind of basic blocks. It makes the variable flow analysis very straightforward.

Ordinary instructions

Ordinary instructions are represented as Operation objects. If it is not explicitly defined, they retrieve their operand usage from the instruction dataset. There are situations where the operand usage has to be overridden such as when using XOR to clean:

# XOR m64, r64
Operation(5973, [v_i, v_i]); # It is important to point out we are doing a clear here.
    operand_usages = ["USAGE_WRITE", null],

The XOR would otherwise have "USAGE_READ_WRITE", "USAGE_READ" data usage. The wrong usage would confuse the register allocator unnecessarily.

Entry

The entry and exit into the function are annotated like this:

entry.code = [
    Entry([v_points, v_count, v_x, v_y, v_r]),
    Jump(jmp, loop, [v_points, v_count, v_x, v_y, v_r])
]

Likewise the exit will be annotated with an Exit -object. The details of these objects weren't entirely clear when I started off, but the purpose of Entry is to produce the stack frame for spilled variables, provide def points for argument variables and create load -operations for every argument location in order to abstract away the details of the calling conventions entirely.

Interface to the register allocator

irc = IRC(register_class_options)
irc.add_move(dst, src)
irc.add_edge(var1, var2)
spilled = irc.work()
assert spilled.length == 0
    "rewrite the program to spill the variables"

We pass Variables and Registers into this system. Every register class, including the spilled variables, have their own allocator session. The IRC could be improved to only recalculate the problem from the halfway during the rewrite of the program, but I will consider that only later because I am going to run an instruction selector again whenever I receive spills. It is rational action on x86 because the load from spilled register can be embedded into an instruction so easily, which makes the repeated spills extremely unlikely to happen.

Flaws in the design

There is a one obvious design flaw in this backend right now. The problem originates from how my assembler introduces the type size along the register.

The problem is that the type information such as the size should be introduced much earlier than at the backend. After that the type should associated with a register class by a mapping.

The individual register allocators should not receive Registers. Instead they should get Precolored -objects to illustrate the argument that is located somewhere during the function entry.

Another problem is that I've not considered how the translation occurs on a 32-bit environment and on Windows. Though the address space that 32-bit Windows can provide is starting to be too thin anyway. It may be more productive to focus on bringing PyPy to 64-bit Windows.

I have to correct for these flaws before I can continue upwards from here. All of the flaws should be relatively straightforward to solve.

The type information can be packed into the module of the extended basic blocks. We may select the same basic types as there are in other IR formats: f32, f64, i8-i128. Then we can have a vector type Vec(t, c) for SIMD.

The changed to Precolored -objects from the registers shouldn't be too difficult either. It's required for the full handling of spilled registers anyway.

I also have to look towards the SPIR-V and webassembly targets in few days. These both have inspiration from LLVM, so our simple typing should turn out to be very useful as we go further onwards.