How to get high perf graphics with a dynamic language

Lever programming language is a dynamic programming language targeted for graphics programmers. Many people would be ready to say the design goals for this language are delusional. Don't listen those people!

Traditional dynamic languages aren't a good fit if you face soft realtime requirements such as in games or virtual reality apps. It's only true because you couldn't drop the dynamic runtime when it started to limit the performance. We are going to make Lever a good fit for soft realtime apps.

Dynamic programming languages support powerful development techniques, they are easy to work with and you get clean and small programs. You may use them to completely outrun your competition as long as your competition isn't understanding dynamic languages and using them too.

Lever's already delivering, below you see the samples/vulkan/sample3.lc running on my Ubuntu 15.10, that has Intel Core2 Quad CPU Q9550 running at 2.83GHz and Nvidia GTX 960.

screenshot

It displays 200000 moving colorful dots using a compute shader. The compute shader happens entirely on the GPU, so the interpreter is just filling out command buffer and submitting it via the Vulkan API every frame. The source code is cleaner and easier to comprehend than the C++ API examples.

graph0

Here's a graph displaying how many milliseconds our interpreter spends at each frame. The red bar depicts our frametime budget if we intend to render 90Hz and keep Oculus Rift well-fed with our frames to ensure a satisfying user-experience. We've got 100 frames in the window.

Here's the same, but the interpreter's JIT is disabled:

graph1

The average running time of both benchmarks were about 0.2ms, so the JIT was probably doing jack squat. From the looks of it, it seem to be in fact taxing our frametime budget rather than preserving it. We are 2 times near missing a frame during our frame window.

It's already feasible on simple things. If we manage to use only half of our frametime budget, we may be able to hide most of these performance spikes. But what if we really required all the performance we can have?

I haven't had much of effort to optimize this interpreter, and I've saved it for a time that an app comes that really deserves the optimizations. I can at least double the performance depicted here. But what if that's not enough either?

Translating Lever

Once I have it written down, I can go all the way down by using the same technique I used to obtain an efficient interpreter from python source code and made Lever itself fast to develop: I can translate active part of my renderer directly into native machine code!

Think about it: I've got type annotations to call any C API directly, I got existing working example of how to translate dynamically typed programs, I have written few compilers myself before. I even have an intermediate step that makes it all useful before it can do what I am proposing.

I took care that the vector arithmetic in Lever would match that in GLSL, to make it leaner to translate them into SPIR-V. We can make Lever translate shaders before it can compile whole rendering loops.

I have taken features from python that helped in translating it, and I included those features in my language. The plan is to make sure the translation framework is less of a hack than what it was in the PyPy.

What's a translation framework and what does it let you to do? That's easiest to explain with an example. Consider the following program:

factorial = (k):
    if k == 1
        return 1
    else
        return factorial(k-1) * k

Then lets consider we've got an entry point to enter this factorial from C, it's got type signature: i32(i32)

The descripiton I am about to give is simple for presentation purposes. In real implementation there would be more going on, but those special cases could be further presented by similar examples that outline just the special case. If I used much more time on my blog post, I could have done them if I wanted to do them now.

Extracting the bytecode

When the translator gets the program above, it has been already compiled into bytecode that is understood by an interpreter. The bytecode needs to be extracted from the entry point:

factorial(k):
    cond byes, k == 1
    return factorial(k-1) * k
byes:
    return 1

The above "bytecode" is extremely compressed for display. The actual bytecode has each line broken into several simpler instructions.

Programs that cannot be extracted cannot be translated.

Type inference

Next the program goes through an analysis that annotates the types. Mainly the types are determined by forward flow analysis:

factorial(i32):
    cond byes, i32 == int
    return {i32(i32)}(i32-int) * i32
byes:
    return int

The forward flow mostly succeeds at annotating everything with the most specific type. There's few points here where it doesn't manage to determine more specific type with a forward flow.

Lever's got straightforward multimethod dispatch on operators to support translation. It can obtain individual methods in those tables and resolve the method ahead of time into (int, int).

Every function and type that factorial referred to here can be resolved into low level presentation, which means the type inference stage is completed.

Programs that cannot be type inferenced cannot be translated.

Lowering

We need to obtain something equivalent to C code, so every high level instruction needs to be replaced by a low level equivalent. This is possible because we hold type annotations to our code now. We only need to determine what they lower to:

factorial function   lowers into i32(i32) native function
==(int:i32, int) lowers into i32==i32
-(int:i32, int)  lowers into i32-i32
*(int:i32, int:i32)  lowers into i32*i32

Programs that cannot be lowered cannot be translated.

After lowering, we can use the same techniques C compilers use to translate into native machine code. Or we could feed it as an input to the LLVM or output it into the SPIR-V.

Fin

We shouldn't stick to statically typed languages. They are not cost effective way to program. Mandatory type annotations or mandatory type coercions get too much into the way when you're supposed to create something new. Lets get rid of it.

The PyPy translator is able to generate just in time compilers from interpreters. I don't have plans to divert from using the PyPy to translate lever. I would rather let the translation utilities in Lever emphasize their more common usecases.

There's more about creating translators in the PyPy papers. I really approve this approach to creating interpreters. I think without them I wouldn't have made so far here alone.

Similar posts