Lever's translator

I've been analysing MLsub before, but it turned out to be much more exciting than what I thought it to be. And I already thought it to be quite amazing.

There are two papers I want to introduce to you:

"Simple and Efficient Construction of Static Single Assignment Form".
"Algebraic Subtyping", by Stephen Dolan.

The first paper describes an easy to understand method to convert from bytecode or AST into static single assignment form. The SSA simplifies many practical compiler optimizations.

The second paper describes an algebra for subtyping and supplies algorithms to implement operations over this algebra using a finite state machine to represent types.

Together these two papers provide a basis for optimizing Lever programs to the point that their performance does not only par but even surpasses that of the programs compiled from the C family of programming languages.

The optimization occurs by partial evaluation of the input program.

Theory

A type is a value that describes how a value in a program may behave and in which ways it can be represented. The algebraic subtyping module can deduce facts about types flowing through the program.

The SSA represents the program such that every variable is assigned exactly once in the program. This ensures that the type flow information provided by the subtyping module is very precise.

To understand what we can do with this information requires that you understand dynamic typing really well. In a dynamically typed program every object at runtime has a type tag.

For example, addition in Lever is not a single function. It is a multiple method. The + operation in Lever checks the following dispatch table:

[int, int]:     int_add
[float, float]: float_add

If the correct function signature is not found in the dispatch table the multimethod has a default behavior it calls. The default behavior in + operation invokes the coerce() multimethod, which looks in an another table on how it could implicitly coerce the variables to be compatible with each other.

Despite the late binding the Lever programs are easy to understand because of a gentleman contract: Meaning of any expression should stay constant. Do not implement + for your type if it isn't addition. Do not name your method to .add if it doesn't do something similar to the set.add.

If we know that we have call to + which will receive only integers, we can replace the + with int_add and inline it because it is a primitive operation.

Such a replacement allows us to know the result of the +, so one partial evaluation leads to an another. We can repeat this for the program until no further partial evaluations occur.

Example

How this all fits together is best explained through some examples. Here's the very first function that I have managed to translate:

first_function = (a, b):
    while a < b
        a *= 2
    return a + b

Note that it is important for any translation of a program to faithfully follow the semantics of the original program when it is practical to do so, because it will allow you to choose a different evaluation strategy later if necessary.

The module containing this functino is first load into the Lever's runtime. You can call first_function(2, 10) and get back 26.

From there the bytecode and the environment of the first_function is extracted using the introspection functionality in the runtime. The bytecode is rewritten into SSA with the algorithm described in the first paper.

The output SSA looks like this:

L0:
  jump L1
L1:
  v7 = ф (L0 arg0) (L2 v1)
  v0 = call <multimethod "<"> v7 arg1
  cond v0 L2 L3
L2:
  v1 = call <multimethod "*"> v7 2
  jump L1
L3:
  v2 = call <multimethod "+"> v7 arg1
  ret v2

The SSA is stored in a translation unit, and it is extracted from every function that the program encounters.

At this point the subtyping annotator is instantiated. It is a passive utility that records type relations and annotates variables with constrained types. The subtyping annotator is based on the second paper.

To prepare for the partial evaluation, every existing subtyping relation in the program is provided to the annotator. This ensures that the variables are correctly constrained for the partial evaluator. The beauty here is that this step doesn't need to be repeated: The newly discovered constraints constrain the variables further, but the previous constrains do not need to be revisited again.

The partial evaluator partially evaluates type variables and dispatches away from the program until it is unable to do so.

After the partial evaluation the SSA in the translation unit looks like this:

L0:
  jump L1
L1:
  v7 = ф (L0 arg0) (L2 v1)
  v0 = lt v7 arg1
  cond v0 L2 L3
L2:
  v1 = mul v7 2
  jump L1
L3:
  v2 = add v7 arg1
  ret v2

And the subtyping annotator contains the following type information:

arg0 -> [{<Int>}, {<Int>}]
arg1 -> [{<Int>}, {<Int>}]
L0:
L1:
  v7 -> [{<Int>}, {}]
  v0 -> [{<Bool>}, {}]
L2:
  v1 -> [{<Int>}, {}]
L3:
  v2 -> [{<Int>}, {}]

The types are described in boundaries. The first (negative) type represents the most generic value that the variable may contain whereas the second (positive) type represents the least generic value here. These values approach each other when the variables are constrained further.

This is enough that we can feed this code into any traditional compiler backend such as LLVM. We might also convert it into a SPIR-V module and run our program on a GPU.

Potential

So, how could this surpass the performance of the equivalent C programs? It comes from the leverage.

The language you use to program the GPU and CPU are the exactly same here, so less effort is required in flipping the target. You have more room for deciding which system will evaluate your code because it's more flexible to change it later. Additionally it gives you the software fallback almost for free!
The translator is a programmable library within a dynamically typed language. This enables you to implement new source level annotations and transformations that would be less feasible to provide for C.

Purpose

The approach here resembles PyPy's RPython a lot, but the details differ a lot because the RPython translator is meant to run in one single big unit, whereas Lever's translator is meant to run in small and multiple units.

Whereas RPython is meant for translating interpreters into machine code, Lever's translator is supposed to:

Optimize the hot loops where a JIT+GC cannot hold up.
Provide access to GPU computing capabilities.
Translate small programs into webassembly and javascript.

Lever is still under development, and we seek synergy between the many small parts that make it up. The translator is supposed to be compatible with the other concepts present in Lever, such as the automatic differentiators, vector arithmetic, C FFI. This means that you can:

Use the C structs as the building blocks in your translation units.
Build interfaces between the translation units from C functions and pass the function pointers across to connect them together.
Generate compiled automatically differentiated functions.
Have code containing vector arithmetic translated into machine code just like everything else.

Deficits

All of the Lever programs do not translate. In practice a very narrow subset of all possible Lever programs can be translated.

As the translator expands in features, it will be able to translate more programs and target more uses. But many Lever programs need to be partially rewritten if it turns out they have to be optimized by translation.

Once the resulting program has been rewritten for translation, it may not run properly or fast enough without the translation anymore. Also, some Lever programs may translate on some kind of target platforms but not translate on other kind of targets.

For example, on GPU targets you cannot use dynamic memory management or garbage collected memory management. Therefore programs requiring such features to function cannot be translated to GPU targets.

As an another example, the memory management may have to be explicit when targeting microcontrollers with no computing power to support garbage collection strategies.

Additionally there are usecases where use of Lever with or without translator does not make sense.

If you have a warehouse full of computers, it may be better to use a language that does extensive optimizations for whole programs. In this case the following conditions fill up:

The programmer's time can be surprisingly cheap compared to the computing time because the computing time compounds heavily.
On average a dynamically typed Lever program runs 10x slower than statically typed program inside a JIT. If 50% of your program runs 5% of the time, and you 10x the computation time of that portion, it will decrease the performance of the whole program by 45%.

But in other hand, if your program requires complicated initialization but there are small portions that must perform fast, and you can parallerize the code to run on the GPU, then Lever will suit you extremely well.

Most interactive desktop applications have very small hot loop portions and numerous cold paths. Additionally these programs can be very complex and require lot of experimentation.

Performance

The author of MLsub thesis boasts that this approach to subtyping is very performant. The translator made like this is likely to perform very well and we should reach interactive speeds of translation for many kind of programs.

If it takes too long to translate, the results have to be memoized into a file and only have them recompiled if the source changes. There will be a separate utility in Lever to support this kind of workflow.

Error messaging during translation

If something goes wrong and your program doesn't translate for some reason, the SSA will be annotated with the references that allow the generated code to be mapped back into the original source code. These references allow the translation produce error reports for where the type constraints were violated or where they turned out to not constrain the program sufficiently.

Sometimes such error reports can fail to directly explain the source of the problem. Fortunately Lever's source locations point out the precise location in the file for every SSA expression that were generated. The precision can be used to overlay the subtyping annotations with the source files.

Future usage

It will still take some time before these solidified ideas will turn into a minimum viable product. I still try to describe a vision of how it'll be used because it helps in the design phase.

The translator will be likely invoked by importing the appropriate library and the target, then opening a translation unit. The code to do so will be something like:

import translator
import spirv
import ffi

unit = translator.Unit(spirv.target, {
    float_constant_precision = ffi.float
})

The float_constant_precision illustrates how you can adjust what the translator produces.

Next you add the entry points and the seed types.

unit.add("fragment_main", fragment_shader,
    fragment_shader_header)

unit.add("vertex_main", vertex_shader,
    vertex_shader_header)

Then you run the translator and see it produce the output you configured it to produce.

shader = unit.translate()

Finally you use the results the way you intended to:

pipeline = init_pipeline(
    shader.fragment_main, shader.vertex_main)

The same results as in optimized C programs, but you get them with far less effort.

There's a limit to this, but it's not in my sight.