GPU computing through SPIR-V

SPIR-V is an intermediate language for GPU shaders and kernels. It is meant to be used with Vulkan and OpenCL.

I'm preparing to using SPIR-V as a compile target. Therefore I have to understand the details of this format fairly well before I proceed.

I used the "Simple Vulkan Compute Example" as a starting point. Though due to the missing information in the mentioned post, it might have been easier to just read the vulkan spec instead.

OUTLINE of this post

This post discusses...

Study of a SPIR-V program produced by a simple GLSL compute program.
How to disassemble SPIR-V.
Details about the structure of a SPIR-V file.
How to run the mentioned compute program with Vulkan.
An idea to building up SPIR-V binaries from your own IR.

Disclaimer: These are my notes and should not be relied upon.

The SPIR-V program

GPU computing has become much simpler over time, yet it is still quite complicated. For this reason our first program is going to be really simple. Here's the GLSL source code for it:

/* Mandatory for Vulkan GLSL */
#version 450
#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

/* Describes a buffer, as an array of unsigned integers */
layout(binding = 0) buffer InOut {
    uint result[];
};

/* Lets store the invocation ID into each index */
void main() {
    result[gl_GlobalInvocationID.x] = gl_GlobalInvocationID.x;
}

If you dispatch this with thousands of workers and they have increasing invocation ID, you get the array filled with [0, 1, 2, ...]. Not very interesting or useful. Just something that we can observe. If you discard every fourth byte and interpret it as a RGB image, the content of result buffer end up looking like this:

output

The above code is in GLSL. This glsl_blur.comp can be compiled into SPIR-V code with the following linux shell command:

glslangValidator -V glsl_blur.comp -o glsl_blur.comp.spv

The glslangValidator has to be installed before you can use it. Therefore the development with the GLSL isn't as simple as it used to be. But it's just a good thing. GLSL is a dumb language to start with.

Some tools for manipulating SPIRV files are available from the SPIRV-Tools. Among these you find spirv-dis

Here's the output of spirv-dis glsl_blur.comp.spv:

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 1
; Bound: 24
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main" %gl_GlobalInvocationID
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_ARB_separate_shader_objects"
               OpSourceExtension "GL_ARB_shading_language_420pack"
               OpName %main "main"
               OpName %InOut "InOut"
               OpMemberName %InOut 0 "result"
               OpName %_ ""
               OpName %gl_GlobalInvocationID "gl_GlobalInvocationID"
               OpDecorate %_runtimearr_uint ArrayStride 4
               OpMemberDecorate %InOut 0 Offset 0
               OpDecorate %InOut BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 0
               OpDecorate %gl_GlobalInvocationID BuiltIn GlobalInvocationId
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
%_runtimearr_uint = OpTypeRuntimeArray %uint
      %InOut = OpTypeStruct %_runtimearr_uint
%_ptr_Uniform_InOut = OpTypePointer Uniform %InOut
          %_ = OpVariable %_ptr_Uniform_InOut Uniform
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
%gl_GlobalInvocationID = OpVariable %_ptr_Input_v3uint Input
     %uint_0 = OpConstant %uint 0
%_ptr_Input_uint = OpTypePointer Input %uint
%_ptr_Uniform_uint = OpTypePointer Uniform %uint
       %main = OpFunction %void None %3
          %5 = OpLabel
         %18 = OpAccessChain %_ptr_Input_uint %gl_GlobalInvocationID %uint_0
         %19 = OpLoad %uint %18
         %20 = OpAccessChain %_ptr_Input_uint %gl_GlobalInvocationID %uint_0
         %21 = OpLoad %uint %20
         %23 = OpAccessChain %_ptr_Uniform_uint %_ %int_0 %19
               OpStore %23 %21
               OpReturn
               OpFunctionEnd

This is the information stored in the SPIR-V file. It is divided into the following sections that have to come in the correct order:

Capabilities
Extensions/Imports
Memory model
Entry points
Debugging info
Type/Variable declarations
Function declarations

Above most of that information isn't very interesting. The the OpSource, OpSourceExtension, Op*Name are only for debugging and could be discarded for compaction.

The most interesting lines for now can be found from the top:

OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %main "main" %gl_GlobalInvocationID
OpExecutionMode %main LocalSize 1 1 1

There can be only one OpMemoryModel which describes the address and data model for the whole module. On the other hand there can be multiple entry points and many execution mode flags are allowed as well.

The things after the visible name in the entry point are input/output variables. The input/output that may be used by the entry point must be identified at that point.

The LocalSize(1,1,1) tells the size of working groups used by the entry. I'm not certain what it means because I've not used the compute modules a lot. It sounds simple though.

The actual program could be also written like this:

entry:
    i18 = access_chain gl_GlobalInvocationID.x
    i19 = load i18
    i20 = access_chain gl_GlobalInvocationID.x
    i21 = load i20
    i23 = access_chain InOut.result[i19]
    store i23 i21
    ret

i18 : uint* (Input)
i19 : uint
i20 : uint* (Input)
i21 : uint
i23 : uint* (Uniform)

The above is a sketch of an intermediate language I intend to use for representing most of the above. It is still open how this will spin out eventually.

The Vulkan side (in Lever)

There are hoops and loops that we can hardly skip now. Ideally using the GPU shouldn't be much harder than using any other computing resource on the computer.

Fortunately it's not too bad. Lets walk through the Lever code which is the abstraction of the task.

The first thing is to obtain access to the GPU. In Lever there's a module for this:

gpu = GPU()
device_only = GPUMemory(gpu, device_flags)
readback = GPUMemory(gpu, readback_flags)
upload = GPUMemory(gpu, upload_flags)

The later three are memory heap managers. We only use the readback here. The distinction between these three heaps is not directly coming from the Vulkan API. But it's a neat arrangement anyway.

Loading the shader program

Loading the shader into the GPU is trivial. Note that if your shader binary is invalid, this will lead to crash in the process.

shader_data = spirv.write_buffer(unit)
module = vulkan.ShaderModule(gpu, {
    codeSize = shader_data.length
    pCode = shader_data
})

custom_shader_stage = {
    stage = "COMPUTE_BIT"
    module = module
    name = "main"
}

Data & Descriptor sets

Without any data we can do nothing. Fortunately it's easy to set up a buffer:

size = 512 * 512 * 4
buffer = readback.createBuffer(size, "STORAGE_BUFFER_BIT")
data = buffer.mem.map(ffi.byte)

The Buffer needs to inform the system of how it's being used. Note that it's already memory mapped for reading. It's just for convenience as there's not a reason to map it yet though.

The descriptor set and pipeline layout describe what kind of layout our shader program expects for. It could be directly or partially derived from the shader program, so this is a little bit redundant.

DSC0 = DescriptorSetLayout(gpu, {
    "output_data": {
        binding = 0
        descriptorType = "STORAGE_BUFFER"
        stageFlags = "COMPUTE_BIT"
    }
})

pipeline_layout = vulkan.PipelineLayout(gpu, {
    flags = 0
    setLayouts = [DSC0]
    pushConstantRanges = []})

A lot of things in Vulkan are redundant data that just needs to be there, but it may be still hard to determine where it should come from.

Finally a descriptor set is created:

dsc0 = DSC0()
dsc0.output_data = {
    buffer = output_data.buffer
    offset = 0
    range = -1
}
gpu.update(dsc0)

This is used to attach the memory and buffers into the module as it runs.

Pipeline

Here's how a compute pipeline is created:

pipeline = gpu.createComputePipelines(gpu.pipeline_cache, [{
    stage = custom_shader_stage
    layout = pipeline_layout
}])[0]

The compute pipeline is among the simplest pipelines in the Vulkan. Now everything's ready and we can construct a command buffer to be submitted into the GPU.

Command buffer

The command buffer is a log of commands to be evaluated. First they are only recorded, later the submit -command sends them over to the GPU for evaluation.

Note how the command buffer collects everything that we created and bundles them together.

cbuf = gpu.queue.pool.allocate({
        level = "PRIMARY",
        commandBufferCount = 1})[0]
cbuf.begin({flags = "ONE_TIME_SUBMIT_BIT"})
cbuf.bindPipeline("COMPUTE", pipeline)
cbuf.bindDescriptorSets("COMPUTE", pipeline_layout, 0,
    [dsc0], [])
cbuf.dispatch(512 * 512, 1, 1)
cbuf.end()

That's all. You don't need to worry about implicit state being tucked anywhere because there's none in Vulkan.

Submission & Waiting

Finally for the commands to take effect they need to be submitted. This buffer is supposed to be submitted only once because we told so in the earlier step.

fence = vulkan.Fence(gpu, {})

gpu.queue.submit([{
    waitSemaphores = []
    waitDstStageMask = []
    commandBuffers = [cbuf]
    signalSemaphores = []
}], fence)

A fence is passed to the above command, to make sure we don't read the results before they've been written, we have to wait for the fence before accessing the data.

status = gpu.waitForFences([fence], true, -1)
assert status.SUCCESS, status
write_rgb_image_file("output.png", data)

Cleaning up

Finally, lots of GPU resources are such that you have to explicitly release them. I've made this simple in Lever, all you need to call is gpu.destroy:

gpu.destroy()

In general, this should be sufficient. There may be special cases where additional steps are needed.

Building up SPIR-V binaries

There is a lot involved in building a backend for SPIR-V binaries. I propose you attempt it in layers if you try.

The very first step you have to take, is to become capable of decoding and encoding the SPIR-V binaries. This is easy because in the SPIR-V registry they have machine-readable specification of the format. Using that file you should be able to get it all open.

Note that you don't need to decode every instruction because every instruction has their length encoded in them. So the spec may be used several ways. Anyways it helps you get the decoding/encoding right without guesses or manual labor! Very useful.

ID table

Every ID in the SPIR-V file is a reference to an object. When generating the file you have to keep track on which reference equals to which ID. Also you have to assign new IDs whenever you go.

For this purpose it's a good idea to have an ID table. You retrieve IDs from this table, and when your file is finished, you take the highest unused ID and assign it to the bound -field in the SPIR-V header.

Also if you prefer, add some checking that you get the roles IdResult, IdResultType, IdRef correct.

The instruction containing IdResult is thought to create an object.
The instruction IdResultType is thought to be a type ID for the result created in that instruction.
IdRef is thought as an ordinary reference.

Generic structure of the file

You're likely best off when building the file in four segments:

The header (top).
Debug info (debug).
types and variables (head).
The functions (body).

Filling the header

The header should be the most straightforward to fill.

Here's all the information you need for filling it:

capabilities = [ "Shader" ]
externals = [ glsl_std_450 ]
addressing_model = "Logical"
memory_model = "GLSL450"

compute_entry = object();
    exec_model = "GLCompute" # Vertex, TessellationControl, TessellationEvaluation, Geometry, Fragment, GLCompute, Kernel
    func = main_func
    interface = [global_invocation_id] # The set of Input/Output variables this
                                       # program entry may use.
    execution_modes = [
        Tag("LocalSize", [1, 1, 1])
    ]
entries = [ compute_entry ]

And here's how you fill them in:

# All OpCapability instructions
for cap in capabilities
    top.append(Tag("OpCapability", [cap]))

# Optional OpExtension instructions (extensions to SPIR-V)
# Optional OpExtInstImport instructions.
for ext in externals
    top.append(Tag("OpExtInstImport", [table.result(ext), ext.name]))

# The single required OpMemoryModel instruction.
# Addressing models:
#   Logical.
#   Physical32, address width is 32 bits, requires 'Addresses' capability
#   Physical64, address width is 64 bits, requires 'Addresses' capability
# Memory models:
#   Simple,  no shared memory consistency  issues, requires 'Shader' capability
#   GLSL450, needed by GLSL and ESSL, requires 'Shader' capability
#   OpenCL,  requires 'Kernel' capability
top.append(Tag("OpMemoryModel", [addressing_model, memory_model]))

# All entry point declarations, using OpEntryPoint
for entry in entries
    interface = []
    for variable in entry.interface
        interface.append(table.ref(variable))

    top.append(Tag("OpEntryPoint", [
        entry.exec_model,
        table.ref(entry.func),
        entry.func.name,
        interface]))

# All execution mode declarations, using OpExecutionMode
for entry in entries
    for mode in entry.execution_modes
        top.append(Tag("OpExecutionMode", [table.ref(entry.func), mode]))

That's the trivial part.

Debug info

You would likely have to fill up the debug info the upcoming step, but you can start without it.

Here's some idea what's involved in giving that info.

Tag("OpSource",           ["Unknown", 0, null, null])
# If the language has extensions that need to be known in debugging,
# add them like this.
Tag("OpSourceExtension",  ["GL_ARB_separate_shader_objects"]),
Tag("OpSourceExtension",  ["GL_ARB_shading_language_420pack"]),
# Use OpName, OpMemberName to name your variables and struct members.
Tag("OpName", [table.ref(obj), name]))

Generating the payload

As the last part you write in the function bodies. As you write the function bodies in, write in the type information where needed.

functions = [ main_func ]

builder = object();
    # Debug instructions, in order: 1. OpString, OpSourceExtension, OpSource, OpSourceContinued, (*)
    #                               2. OpName, OpMemberName
    # (*. without forward references)
    debug = []
    # decor section not a real requirement? but the validator seems to gunk without
    # pulling the OpDecorate -instructions up.
    decor = []
    # All type declarations (OpTypeXXX), all constant instructions, all global variable
    # declarations, (all OpVariable instructions whose Storage Class not a Function)
    # Preferred location for OpUndef instructions.
    # All operands must be declared before use, otherwise they can be in any order.
    # The first section to allow use of OpLine debug information.
    head = []
    # All function declarations (no forward declaration to a function with a body TODO: what does this mean?)
    # All function definitions
    body = []
    # All items we visited, and the ID table (can be merged)
    visited = set()
    table = table

for function in functions
    function.build(builder)

# This is done here to ensure that the references in the interface are defined.
for entry in entries
    for variable in entry.interface
        variable.visit(builder)

unit = object()
unit.generator = 0 # TODO: register the IR generator?
unit.bound = table.bound # They IDs start from '1'
unit.instructions = top ++ debug ++ builder.head ++ builder.body

shader_data = spirv.write_buffer(unit)

And it's done like in the famous owl tutorial, you can write the results into a buffer and store it for later use.

Data structures

Where does all those details come from? The answer is that they come from the main_func. The details are quite convoluted. Lets start from the beginning.

First some data types are defined. They entirely start from the basics:

void_type = OpTypeVoid()
int_type  = OpTypeInt(32, true)
uint_type = OpTypeInt(32, false)

vec3_uint_type = OpTypeVector(uint_type, 3)

ptr_uniform_uint    = OpTypePointer("Uniform", uint_type)
ptr_input_uint      = OpTypePointer("Input", uint_type)
ptr_input_vec3_uint = OpTypePointer("Input", vec3_uint_type)

When each of these structures is visited, an instruction describing the type is inserted into the .head section of the builder.

# The size of the RuntimeArray is not known before the runtime.
runtime_array_type = OpTypeRuntimeArray(uint_type)
runtime_array_type.decorate = [
    Tag("ArrayStride", [4])
]

# The buffer block in the program
struct_type = OpTypeStruct([
    Member("result", runtime_array_type);
        decorate = [ Tag("Offset", [0]) ] # Offset of this member.
])
struct_type.decorate = [
    "BufferBlock" # Tells that it's a shader interface block
]
uniform_type = OpTypePointer("Uniform", struct_type)

Above the result buffer is described. It's a bit interesting that all of this is represented as pointers. Also it's a bit interesting that the storage buffer is defined as an uniform. But that's true according to the Vulkan spec (13.1.8. Storage Buffer)

The structures are annotated with information about their other properties. It's all easy to understand but you have to be careful to ensure that everything is visited before use. Also, any SPIR-V id may have decorations so you have to make sure that your model matches it.

Then there are the variable definitions. Yes, they won't be present if they're not used. That shouldn't be a trouble though.

uniform_var = OpVariable(uniform_type, "Uniform")
uniform_var.name = "uniform_var"
uniform_var.decorate = [
    Tag("DescriptorSet", [0]),
    Tag("Binding", [0])
]

global_invocation_id = OpVariable(ptr_input_vec3_uint, "Input")
global_invocation_id.name = "global_invocation_id"
global_invocation_id.decorate = [
    Tag("BuiltIn", ["GlobalInvocationId"])
]

All the storage qualifiers and pointers have to go right. The "Input" storage variables have to be listed in the Entry points.

The "BuiltIn" variables are defined in the SPIR-V and Vulkan spec. The GlobalInvocationId tells the id -number of an individual worker inside each of them. The id consists of three integers to allow structuring the parallel computation into three linear dimensions.

Then we have few constants. Each integer type requires their own. Don't ask me why, I don't know. You would preferably reuse these if you can.

uint_0 = OpConstant([0], uint_type)
int_0 = OpConstant([0], int_type)

The SPIR-V format consists of sequence of integers. Because of that there are list of integers to represent each constant. A more mature implementation would likely just hold some byte arrays here.

Finally there's the function and the SSA blocks for it.

func_type = OpTypeFunction([], void_type)
main_func = OpFunction(func_type)
main_func.name = "main"

entry = main_func.new_block()
ins_0 = entry.op("OpAccessChain", [global_invocation_id, [uint_0]], ptr_input_uint)
ins_1 = entry.op("OpLoad", [ins_0, null], uint_type)
ins_2 = entry.op("OpAccessChain", [global_invocation_id, [uint_0]], ptr_input_uint)
ins_3 = entry.op("OpLoad", [ins_2, null], uint_type)
ins_4 = entry.op("OpAccessChain", [uniform_var, [int_0, ins_1]], ptr_uniform_uint)
entry.op("OpStore", [ins_4, ins_3, null])
entry.op("OpReturn", [])

This is the meat of this module. See that every operation with result is required to have a type. The null fields in OpLoad/OpStore are for memory access flags. They can describe whether the memory access is volatile, aligned or nontemporal.

The OpAccessChain retrieves a pointer from the composite objects. It is probably there to distinguish the shader variable access from a real load operation.

From the necessity this system needs quite a large load of datatype classes. I'm not entirely satisfied with how it looks like, but I'll be sure to do something for that very soon.

It is possible that I won't be using separate tools to describe these datatypes. I'll probably extend my CFFI to deal with the special cases required by SPIR-V.

I show some of them anyway, because they'll help to understand the system here now. Here's the void type class:

class OpTypeVoid extends OpType
    visit = (self, builder):
        return self if self in builder.visited
        builder.visited.add(self)
        add_decorate(builder, self)
        builder.head.append(Tag("OpTypeVoid", [
            builder.table.result(self)
        ]))
        return self

Most of the type classes are very much alike, so I won't show more type classes, but here's a constant class:

class OpConstant extends Op
    +init = (self, value, type):
        self.value = value
        self.type = type

    visit = (self, builder):
        return self if self in builder.visited
        builder.visited.add(self)
        add_decorate(builder, self)
        builder.head.append(Tag("OpConstant", [
            builder.table.as_type(self.type.visit(builder)),
            builder.table.result(self),
            self.value
        ]))
        return self

The constant and variable classes are similar. It's pretty much a visit guard + code to write the instruction for the object.

The OpFunction and pals should be an interesting bunch:

class OpFunction extends Op
    +init = (self, type, blocks=[]):
        self.type = type
        self.blocks = blocks
        self.control = [] # https://www.khronos.org/registry/spir-v/specs/1.0/SPIRV.html#Function_Control
                          # None, Inline, DontInline, Pure, Const

    build = (self, builder):
        type = self.type.visit(builder)
        builder.body.append(
            Tag("OpFunction", [
                builder.table.as_type(type.restype),
                builder.table.result(self),
                self.control,
                builder.table.ref(type) ]))
        for block in self.blocks
            builder.body.append(
                Tag("OpLabel", [builder.table.result(block)]))
            for ins in block.body
                ins.build(builder)
        builder.body.append(
            Tag("OpFunctionEnd", []))

    new_block = (self):
        block = OpBlock()
        self.blocks.append(block)
        return block

class OpBlock
    +init = (self, body=[]):
        self.body = body

    op = (self, name, args, restype=null):
        op = Operation(name, args, restype)
        self.body.append(op)
        return op

class Operation extends Op
    +init = (self, name, args, restype=null):
        self.name = name
        self.args = args
        self.restype = restype

    visit = (self, builder):
        assert self.restype, "has no result"
        return self

    build = (self, builder):
        args = []
        if self.restype
            restype = self.restype.visit(builder)
            args.append(builder.table.as_type(restype))
            args.append(builder.table.result(self))
        for arg in self.args
            args.append(rename(builder, arg))
        builder.body.append(
            Tag(self.name, args))

The functions and operations are not 'visited', they are built directly into the program. I would probably write it all into a single function, but I still have to explore which kind of structure is best for generating the function forward declarations.

A peek into the future

There is obviously a lot of effort compared to just using GLSL & OpenGL compared to everything you see above. But it is really convenient if you eventually want to do the following:

import ffi, spirv_target

gl = spirv_target.Env()
result = spirv_target.StorageBuffer(
    ffi.array(ffi.int),
    {set = 0, binding = 0, name = "result"})

compute_main = ():
    i = gl.invocation_id.x
    result[i] = i

program = spirv_target.compute(compute_main)

To achieve this I have to write a partial evaluator for my language. It's halfway there but I have some critical details still missing.

There is added difficulty because I want to reuse the same partial evaluator for CPU and webassembly.

Ramifications of this tool can be dramatic. If it is easy to write GPU code, you will obviously write more of it. Your programs become faster simply because you'll have easier access to all the computing resources on your computer.

It doesn't end up there though, because the translator will obtain a lot of information about what we intended to do in the first place. It can be used to pre-fill some of the records required to establish the shader state, so that we won't need to do it later. We no longer need to define layouts twice so the code will begin to look like this:

gpu = GPU()

shader = program.load(gpu)

dsc0 = shader.DSC0()
dsc0.result = {
    buffer = output_data.buffer
    offset = 0
    range = - 1
}
gpu.update(dsc0)

pipeline_layout = shader.pipeline_layout

Moreover, the possibility to easily transfer code between your GPU, optimized and dynamic codebase allows flexibility that has been never been there before. The program design can be explored much farther than ever before.

It'll be a perfect platform for programming many kinds of demanding applications.