A dynamically typed register allocator
Earlier this month I presented an assembler written in Lever. Now I have a register allocator for it. I implemented iterated register coalescing and I am going to use it for allocating GPR, SSE registers and the spilled variables of the same size and type.
lib/backend.lc implements that algorithm and it reads very much like the pseudocode in the paper, except that all the details about how it translates into the practice are present.
Motivation
The register allocation is the first step in making an optimized assembly program more portable and more convenient to produce.
With register allocation we no longer have to assign a specific register for every variable. Already dropping this details makes the assembly programming much more tolerable.
Portability is improved because register allocation allows us to vanish the details of the calling conventions. Register allocation is required that we can run the same code on Windows and Linux without having to come up with our own ABI.
Benchmark
Because there's a practical reason why I am doing this, it is interesting to know whether we gain anything by compiling our code. If JIT was running as fast as the compiled code, there would be no point in having a compiling step. Unfortunately this is not the case.
I wrote a benchmark program that plots 20000 non-overlapping circles into an array. Roughly 900 randomly generated circles are discarded. My assembly optimizations attempts focused at the inner loop.
The interpreted version in JIT takes 12 seconds to complete. The assembly version runs in 0.7, and it takes only 0.9 seconds even if we did not coalesce the moves. So it means there can be as much as 10x speed difference between JIT code and optimized assembly. It sounds a lot but the difference used to be a lot more between Python and C.
Below is the ndisasm -b64
, side-by-side from the loop that
has move instructions coalesced and the another that doesn't
have.
00000000 4D31D2 xor r10,r10 00000000 4D31D2 xor r10,r10
00000003 4939F2 cmp r10,rsi 00000003 4939F2 cmp r10,rsi
00000006 7D35 jnl 0x3d 00000006 7D41 jnl 0x49
00000008 4D6BCA18 imul r9,r10,byte +0x18 00000008 4D6BCA18 imul r9,r10,byte +0x18
0000000C 4E8B1C0F mov r11,[rdi+r9] 0000000C 4E8B1C0F mov r11,[rdi+r9]
00000010 4A8B440F08 mov rax,[rdi+r9+0x8] 00000010 4A8B440F08 mov rax,[rdi+r9+0x8]
00000015 4E8B4C0F10 mov r9,[rdi+r9+0x10] 00000015 4E8B4C0F10 mov r9,[rdi+r9+0x10]
0000001A 4C2BDA sub r11,rdx 0000001A 4C2BDA sub r11,rdx
0000001D 482BC1 sub rax,rcx 0000001D 482BC1 sub rax,rcx
00000020 4D03C8 add r9,r8 00000020 4D03C8 add r9,r8
00000023 4D8BDB mov r11,r11
00000023 4D0FAFDB imul r11,r11 00000026 4D0FAFDB imul r11,r11
0000002A 488BC0 mov rax,rax
00000027 480FAFC0 imul rax,rax 0000002D 480FAFC0 imul rax,rax
00000031 4D8BC9 mov r9,r9
0000002B 4D0FAFC9 imul r9,r9 00000034 4D0FAFC9 imul r9,r9
00000038 4D8BDB mov r11,r11
0000002F 4C03D8 add r11,rax 0000003B 4C03D8 add r11,rax
00000032 4D39CB cmp r11,r9 0000003E 4D39CB cmp r11,r9
00000035 7E0A jng 0x41 00000041 7E0A jng 0x4d
00000037 4983C201 add r10,byte +0x1 00000043 4983C201 add r10,byte +0x1
0000003B EBC6 jmp short 0x3 00000047 EBBA jmp short 0x3
0000003D 4831C0 xor rax,rax 00000049 4831C0 xor rax,rax
00000040 C3 ret 0000004C C3 ret
00000041 4831C0 xor rax,rax 0000004D 4831C0 xor rax,rax
00000044 B001 mov al,0x1 00000050 B001 mov al,0x1
00000046 C3 ret 00000052 C3 ret
Anyway we can conclude that the register coalescing works and that the effects it produces does matter.
Elements of the backend
When attempting to optimize the Lever PNG two weeks ago, I made an assembler that used an instruction dataset to encode instructions. It was defined as a function that encodes an instruction and returns a sequence of bytes, with rougly a following command:
output = asm.encode_ins(uid, [operand1, operand2...])
Blocks and Branches
To calculate a relative jump I had to compute the relative offset myself from the buffer. I added some simple logic into it so the compiler could resolve the branch without having to define a variable.
Now I just need to do the following to produce a loop:
jmp = {i32: 886, i8: 887}
jge = {i32: 875, i8: 876} # (SF=0F)
loop = Block()
loop.code = [
# SUB r64 m64
Operation(2327, [a, b]),
Jump(jge, loop, [a, b])
Jump(jmp, exit, []),
]
The first argument determines which instructions Jumps can
resolve to. Of course, if the offset of the jump is zero,
the jump will resolve to no code at all. The second argument
to a Jump
are the variable names. Since the loop depends
on the a
and b
, they are required here.
Because the variable flow is already determined in the extended basic blocks, I decided to not calculate the variable flow across basic blocks on the backend. I think it's ultimately a good decision because there are many ways how I can obtain and need this information long before ending up to the backend.
The whole functions consists of that kind of basic blocks. It makes the variable flow analysis very straightforward.
Ordinary instructions
Ordinary instructions are represented as Operation
objects. If it is not explicitly defined, they retrieve
their operand usage from the instruction dataset. There are
situations where the operand usage has to be overridden such
as when using XOR
to clean:
# XOR m64, r64
Operation(5973, [v_i, v_i]); # It is important to point out we are doing a clear here.
operand_usages = ["USAGE_WRITE", null],
The XOR
would otherwise have "USAGE_READ_WRITE"
,
"USAGE_READ"
data usage. The wrong usage would confuse the
register allocator unnecessarily.
Entry
The entry and exit into the function are annotated like this:
entry.code = [
Entry([v_points, v_count, v_x, v_y, v_r]),
Jump(jmp, loop, [v_points, v_count, v_x, v_y, v_r])
]
Likewise the exit will be annotated with an Exit
-object.
The details of these objects weren't entirely clear when I
started off, but the purpose of Entry is to produce the
stack frame for spilled variables, provide def
points for
argument variables and create load -operations for every
argument location in order to abstract away the details of
the calling conventions entirely.
Interface to the register allocator
irc = IRC(register_class_options)
irc.add_move(dst, src)
irc.add_edge(var1, var2)
spilled = irc.work()
assert spilled.length == 0
"rewrite the program to spill the variables"
We pass Variable
s and Register
s into this system. Every
register class, including the spilled variables, have their
own allocator session. The IRC
could be improved to only
recalculate the problem from the halfway during the rewrite
of the program, but I will consider that only later because
I am going to run an instruction selector again whenever I
receive spills. It is rational action on x86 because the
load from spilled register can be embedded into an
instruction so easily, which makes the repeated spills
extremely unlikely to happen.
Flaws in the design
There is a one obvious design flaw in this backend right now. The problem originates from how my assembler introduces the type size along the register.
The problem is that the type information such as the size should be introduced much earlier than at the backend. After that the type should associated with a register class by a mapping.
The individual register allocators should not receive
Registers. Instead they should get Precolored
-objects to
illustrate the argument that is located somewhere during the
function entry.
Another problem is that I've not considered how the translation occurs on a 32-bit environment and on Windows. Though the address space that 32-bit Windows can provide is starting to be too thin anyway. It may be more productive to focus on bringing PyPy to 64-bit Windows.
I have to correct for these flaws before I can continue upwards from here. All of the flaws should be relatively straightforward to solve.
The type information can be packed into the module of the
extended basic blocks. We may select the same basic types as
there are in other IR formats: f32
, f64
, i8-i128
. Then
we can have a vector type Vec(t, c)
for SIMD.
The changed to Precolored
-objects from the registers
shouldn't be too difficult either. It's required for the
full handling of spilled registers anyway.
I also have to look towards the SPIR-V and webassembly targets in few days. These both have inspiration from LLVM, so our simple typing should turn out to be very useful as we go further onwards.