GPU computing through SPIR-V
SPIR-V is an intermediate language for GPU shaders and kernels. It is meant to be used with Vulkan and OpenCL.
I'm preparing to using SPIR-V as a compile target. Therefore I have to understand the details of this format fairly well before I proceed.
I used the "Simple Vulkan Compute Example" as a starting point. Though due to the missing information in the mentioned post, it might have been easier to just read the vulkan spec instead.
OUTLINE of this post
This post discusses...
- Study of a SPIR-V program produced by a simple GLSL compute program.
- How to disassemble SPIR-V.
- Details about the structure of a SPIR-V file.
- How to run the mentioned compute program with Vulkan.
- An idea to building up SPIR-V binaries from your own IR.
Disclaimer: These are my notes and should not be relied upon.
The SPIR-V program
GPU computing has become much simpler over time, yet it is still quite complicated. For this reason our first program is going to be really simple. Here's the GLSL source code for it:
/* Mandatory for Vulkan GLSL */
#version 450
#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable
/* Describes a buffer, as an array of unsigned integers */
layout(binding = 0) buffer InOut {
uint result[];
};
/* Lets store the invocation ID into each index */
void main() {
result[gl_GlobalInvocationID.x] = gl_GlobalInvocationID.x;
}
If you dispatch this with thousands of workers and they have
increasing invocation ID, you get the array filled with
[0, 1, 2, ...]
. Not very interesting or useful. Just
something that we can observe. If you discard every fourth
byte and interpret it as a RGB image, the content of result
buffer end up looking like this:
The above code is in GLSL. This glsl_blur.comp
can be
compiled into SPIR-V code with the following linux shell command:
glslangValidator -V glsl_blur.comp -o glsl_blur.comp.spv
The glslangValidator
has to be installed before you can
use it. Therefore the development with the GLSL isn't as
simple as it used to be. But it's just a good thing. GLSL
is a dumb language to start with.
Some tools for manipulating SPIRV files are available from the
SPIRV-Tools.
Among these you find spirv-dis
Here's the output of spirv-dis glsl_blur.comp.spv
:
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 1
; Bound: 24
; Schema: 0
OpCapability Shader
%1 = OpExtInstImport "GLSL.std.450"
OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %main "main" %gl_GlobalInvocationID
OpExecutionMode %main LocalSize 1 1 1
OpSource GLSL 450
OpSourceExtension "GL_ARB_separate_shader_objects"
OpSourceExtension "GL_ARB_shading_language_420pack"
OpName %main "main"
OpName %InOut "InOut"
OpMemberName %InOut 0 "result"
OpName %_ ""
OpName %gl_GlobalInvocationID "gl_GlobalInvocationID"
OpDecorate %_runtimearr_uint ArrayStride 4
OpMemberDecorate %InOut 0 Offset 0
OpDecorate %InOut BufferBlock
OpDecorate %_ DescriptorSet 0
OpDecorate %_ Binding 0
OpDecorate %gl_GlobalInvocationID BuiltIn GlobalInvocationId
%void = OpTypeVoid
%3 = OpTypeFunction %void
%uint = OpTypeInt 32 0
%_runtimearr_uint = OpTypeRuntimeArray %uint
%InOut = OpTypeStruct %_runtimearr_uint
%_ptr_Uniform_InOut = OpTypePointer Uniform %InOut
%_ = OpVariable %_ptr_Uniform_InOut Uniform
%int = OpTypeInt 32 1
%int_0 = OpConstant %int 0
%v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
%gl_GlobalInvocationID = OpVariable %_ptr_Input_v3uint Input
%uint_0 = OpConstant %uint 0
%_ptr_Input_uint = OpTypePointer Input %uint
%_ptr_Uniform_uint = OpTypePointer Uniform %uint
%main = OpFunction %void None %3
%5 = OpLabel
%18 = OpAccessChain %_ptr_Input_uint %gl_GlobalInvocationID %uint_0
%19 = OpLoad %uint %18
%20 = OpAccessChain %_ptr_Input_uint %gl_GlobalInvocationID %uint_0
%21 = OpLoad %uint %20
%23 = OpAccessChain %_ptr_Uniform_uint %_ %int_0 %19
OpStore %23 %21
OpReturn
OpFunctionEnd
This is the information stored in the SPIR-V file. It is divided into the following sections that have to come in the correct order:
- Capabilities
- Extensions/Imports
- Memory model
- Entry points
- Debugging info
- Type/Variable declarations
- Function declarations
Above most of that information isn't very interesting. The
the OpSource
, OpSourceExtension
, Op*Name
are only for
debugging and could be discarded for compaction.
The most interesting lines for now can be found from the top:
OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %main "main" %gl_GlobalInvocationID
OpExecutionMode %main LocalSize 1 1 1
There can be only one OpMemoryModel
which describes the
address and data model for the whole module. On the other
hand there can be multiple entry points and many execution
mode flags are allowed as well.
The things after the visible name in the entry point are input/output variables. The input/output that may be used by the entry point must be identified at that point.
The LocalSize(1,1,1)
tells the size of working groups used
by the entry. I'm not certain what it means because I've not
used the compute modules a lot. It sounds simple though.
The actual program could be also written like this:
entry:
i18 = access_chain gl_GlobalInvocationID.x
i19 = load i18
i20 = access_chain gl_GlobalInvocationID.x
i21 = load i20
i23 = access_chain InOut.result[i19]
store i23 i21
ret
i18 : uint* (Input)
i19 : uint
i20 : uint* (Input)
i21 : uint
i23 : uint* (Uniform)
The above is a sketch of an intermediate language I intend to use for representing most of the above. It is still open how this will spin out eventually.
The Vulkan side (in Lever)
There are hoops and loops that we can hardly skip now. Ideally using the GPU shouldn't be much harder than using any other computing resource on the computer.
Fortunately it's not too bad. Lets walk through the Lever code which is the abstraction of the task.
The first thing is to obtain access to the GPU. In Lever there's a module for this:
gpu = GPU()
device_only = GPUMemory(gpu, device_flags)
readback = GPUMemory(gpu, readback_flags)
upload = GPUMemory(gpu, upload_flags)
The later three are memory heap managers. We only use the
readback
here. The distinction between these three heaps
is not directly coming from the Vulkan API. But it's a neat
arrangement anyway.
Loading the shader program
Loading the shader into the GPU is trivial. Note that if your shader binary is invalid, this will lead to crash in the process.
shader_data = spirv.write_buffer(unit)
module = vulkan.ShaderModule(gpu, {
codeSize = shader_data.length
pCode = shader_data
})
custom_shader_stage = {
stage = "COMPUTE_BIT"
module = module
name = "main"
}
Data & Descriptor sets
Without any data we can do nothing. Fortunately it's easy to set up a buffer:
size = 512 * 512 * 4
buffer = readback.createBuffer(size, "STORAGE_BUFFER_BIT")
data = buffer.mem.map(ffi.byte)
The Buffer needs to inform the system of how it's being used. Note that it's already memory mapped for reading. It's just for convenience as there's not a reason to map it yet though.
The descriptor set and pipeline layout describe what kind of layout our shader program expects for. It could be directly or partially derived from the shader program, so this is a little bit redundant.
DSC0 = DescriptorSetLayout(gpu, {
"output_data": {
binding = 0
descriptorType = "STORAGE_BUFFER"
stageFlags = "COMPUTE_BIT"
}
})
pipeline_layout = vulkan.PipelineLayout(gpu, {
flags = 0
setLayouts = [DSC0]
pushConstantRanges = []})
A lot of things in Vulkan are redundant data that just needs to be there, but it may be still hard to determine where it should come from.
Finally a descriptor set is created:
dsc0 = DSC0()
dsc0.output_data = {
buffer = output_data.buffer
offset = 0
range = -1
}
gpu.update(dsc0)
This is used to attach the memory and buffers into the module as it runs.
Pipeline
Here's how a compute pipeline is created:
pipeline = gpu.createComputePipelines(gpu.pipeline_cache, [{
stage = custom_shader_stage
layout = pipeline_layout
}])[0]
The compute pipeline is among the simplest pipelines in the Vulkan. Now everything's ready and we can construct a command buffer to be submitted into the GPU.
Command buffer
The command buffer is a log of commands to be evaluated. First they are only recorded, later the submit -command sends them over to the GPU for evaluation.
Note how the command buffer collects everything that we created and bundles them together.
cbuf = gpu.queue.pool.allocate({
level = "PRIMARY",
commandBufferCount = 1})[0]
cbuf.begin({flags = "ONE_TIME_SUBMIT_BIT"})
cbuf.bindPipeline("COMPUTE", pipeline)
cbuf.bindDescriptorSets("COMPUTE", pipeline_layout, 0,
[dsc0], [])
cbuf.dispatch(512 * 512, 1, 1)
cbuf.end()
That's all. You don't need to worry about implicit state being tucked anywhere because there's none in Vulkan.
Submission & Waiting
Finally for the commands to take effect they need to be submitted. This buffer is supposed to be submitted only once because we told so in the earlier step.
fence = vulkan.Fence(gpu, {})
gpu.queue.submit([{
waitSemaphores = []
waitDstStageMask = []
commandBuffers = [cbuf]
signalSemaphores = []
}], fence)
A fence is passed to the above command, to make sure we don't read the results before they've been written, we have to wait for the fence before accessing the data.
status = gpu.waitForFences([fence], true, -1)
assert status.SUCCESS, status
write_rgb_image_file("output.png", data)
Cleaning up
Finally, lots of GPU resources are such that you have to
explicitly release them. I've made this simple in Lever, all
you need to call is gpu.destroy
:
gpu.destroy()
In general, this should be sufficient. There may be special cases where additional steps are needed.
Building up SPIR-V binaries
There is a lot involved in building a backend for SPIR-V binaries. I propose you attempt it in layers if you try.
The very first step you have to take, is to become capable of decoding and encoding the SPIR-V binaries. This is easy because in the SPIR-V registry they have machine-readable specification of the format. Using that file you should be able to get it all open.
Note that you don't need to decode every instruction because every instruction has their length encoded in them. So the spec may be used several ways. Anyways it helps you get the decoding/encoding right without guesses or manual labor! Very useful.
ID table
Every ID in the SPIR-V file is a reference to an object. When generating the file you have to keep track on which reference equals to which ID. Also you have to assign new IDs whenever you go.
For this purpose it's a good idea to have an ID table. You
retrieve IDs from this table, and when your file is
finished, you take the highest unused ID and assign it to
the bound
-field in the SPIR-V header.
Also if you prefer, add some checking that you get the roles
IdResult
, IdResultType
, IdRef
correct.
- The instruction containing
IdResult
is thought to create an object. - The instruction
IdResultType
is thought to be a type ID for the result created in that instruction. IdRef
is thought as an ordinary reference.
Generic structure of the file
You're likely best off when building the file in four segments:
- The header (top).
- Debug info (debug).
- types and variables (head).
- The functions (body).
Filling the header
The header should be the most straightforward to fill.
Here's all the information you need for filling it:
capabilities = [ "Shader" ]
externals = [ glsl_std_450 ]
addressing_model = "Logical"
memory_model = "GLSL450"
compute_entry = object();
exec_model = "GLCompute" # Vertex, TessellationControl, TessellationEvaluation, Geometry, Fragment, GLCompute, Kernel
func = main_func
interface = [global_invocation_id] # The set of Input/Output variables this
# program entry may use.
execution_modes = [
Tag("LocalSize", [1, 1, 1])
]
entries = [ compute_entry ]
And here's how you fill them in:
# All OpCapability instructions
for cap in capabilities
top.append(Tag("OpCapability", [cap]))
# Optional OpExtension instructions (extensions to SPIR-V)
# Optional OpExtInstImport instructions.
for ext in externals
top.append(Tag("OpExtInstImport", [table.result(ext), ext.name]))
# The single required OpMemoryModel instruction.
# Addressing models:
# Logical.
# Physical32, address width is 32 bits, requires 'Addresses' capability
# Physical64, address width is 64 bits, requires 'Addresses' capability
# Memory models:
# Simple, no shared memory consistency issues, requires 'Shader' capability
# GLSL450, needed by GLSL and ESSL, requires 'Shader' capability
# OpenCL, requires 'Kernel' capability
top.append(Tag("OpMemoryModel", [addressing_model, memory_model]))
# All entry point declarations, using OpEntryPoint
for entry in entries
interface = []
for variable in entry.interface
interface.append(table.ref(variable))
top.append(Tag("OpEntryPoint", [
entry.exec_model,
table.ref(entry.func),
entry.func.name,
interface]))
# All execution mode declarations, using OpExecutionMode
for entry in entries
for mode in entry.execution_modes
top.append(Tag("OpExecutionMode", [table.ref(entry.func), mode]))
That's the trivial part.
Debug info
You would likely have to fill up the debug info the upcoming step, but you can start without it.
Here's some idea what's involved in giving that info.
Tag("OpSource", ["Unknown", 0, null, null])
# If the language has extensions that need to be known in debugging,
# add them like this.
Tag("OpSourceExtension", ["GL_ARB_separate_shader_objects"]),
Tag("OpSourceExtension", ["GL_ARB_shading_language_420pack"]),
# Use OpName, OpMemberName to name your variables and struct members.
Tag("OpName", [table.ref(obj), name]))
Generating the payload
As the last part you write in the function bodies. As you write the function bodies in, write in the type information where needed.
functions = [ main_func ]
builder = object();
# Debug instructions, in order: 1. OpString, OpSourceExtension, OpSource, OpSourceContinued, (*)
# 2. OpName, OpMemberName
# (*. without forward references)
debug = []
# decor section not a real requirement? but the validator seems to gunk without
# pulling the OpDecorate -instructions up.
decor = []
# All type declarations (OpTypeXXX), all constant instructions, all global variable
# declarations, (all OpVariable instructions whose Storage Class not a Function)
# Preferred location for OpUndef instructions.
# All operands must be declared before use, otherwise they can be in any order.
# The first section to allow use of OpLine debug information.
head = []
# All function declarations (no forward declaration to a function with a body TODO: what does this mean?)
# All function definitions
body = []
# All items we visited, and the ID table (can be merged)
visited = set()
table = table
for function in functions
function.build(builder)
# This is done here to ensure that the references in the interface are defined.
for entry in entries
for variable in entry.interface
variable.visit(builder)
unit = object()
unit.generator = 0 # TODO: register the IR generator?
unit.bound = table.bound # They IDs start from '1'
unit.instructions = top ++ debug ++ builder.head ++ builder.body
shader_data = spirv.write_buffer(unit)
And it's done like in the famous owl tutorial, you can write the results into a buffer and store it for later use.
Data structures
Where does all those details come from? The answer is that
they come from the main_func
. The details are quite
convoluted. Lets start from the beginning.
First some data types are defined. They entirely start from the basics:
void_type = OpTypeVoid()
int_type = OpTypeInt(32, true)
uint_type = OpTypeInt(32, false)
vec3_uint_type = OpTypeVector(uint_type, 3)
ptr_uniform_uint = OpTypePointer("Uniform", uint_type)
ptr_input_uint = OpTypePointer("Input", uint_type)
ptr_input_vec3_uint = OpTypePointer("Input", vec3_uint_type)
When each of these structures is visited, an instruction
describing the type is inserted into the .head
section of
the builder.
# The size of the RuntimeArray is not known before the runtime.
runtime_array_type = OpTypeRuntimeArray(uint_type)
runtime_array_type.decorate = [
Tag("ArrayStride", [4])
]
# The buffer block in the program
struct_type = OpTypeStruct([
Member("result", runtime_array_type);
decorate = [ Tag("Offset", [0]) ] # Offset of this member.
])
struct_type.decorate = [
"BufferBlock" # Tells that it's a shader interface block
]
uniform_type = OpTypePointer("Uniform", struct_type)
Above the result buffer is described. It's a bit interesting
that all of this is represented as pointers. Also it's a
bit interesting that the storage buffer is defined as an
uniform. But that's true according to the
Vulkan spec (13.1.8. Storage Buffer)
The structures are annotated with information about their other properties. It's all easy to understand but you have to be careful to ensure that everything is visited before use. Also, any SPIR-V id may have decorations so you have to make sure that your model matches it.
Then there are the variable definitions. Yes, they won't be present if they're not used. That shouldn't be a trouble though.
uniform_var = OpVariable(uniform_type, "Uniform")
uniform_var.name = "uniform_var"
uniform_var.decorate = [
Tag("DescriptorSet", [0]),
Tag("Binding", [0])
]
global_invocation_id = OpVariable(ptr_input_vec3_uint, "Input")
global_invocation_id.name = "global_invocation_id"
global_invocation_id.decorate = [
Tag("BuiltIn", ["GlobalInvocationId"])
]
All the storage qualifiers and pointers have to go right. The "Input" storage variables have to be listed in the Entry points.
The "BuiltIn" variables are defined in the SPIR-V and Vulkan
spec. The GlobalInvocationId
tells the id -number of an
individual worker inside each of them. The id consists of
three integers to allow structuring the parallel computation
into three linear dimensions.
Then we have few constants. Each integer type requires their own. Don't ask me why, I don't know. You would preferably reuse these if you can.
uint_0 = OpConstant([0], uint_type)
int_0 = OpConstant([0], int_type)
The SPIR-V format consists of sequence of integers. Because of that there are list of integers to represent each constant. A more mature implementation would likely just hold some byte arrays here.
Finally there's the function and the SSA blocks for it.
func_type = OpTypeFunction([], void_type)
main_func = OpFunction(func_type)
main_func.name = "main"
entry = main_func.new_block()
ins_0 = entry.op("OpAccessChain", [global_invocation_id, [uint_0]], ptr_input_uint)
ins_1 = entry.op("OpLoad", [ins_0, null], uint_type)
ins_2 = entry.op("OpAccessChain", [global_invocation_id, [uint_0]], ptr_input_uint)
ins_3 = entry.op("OpLoad", [ins_2, null], uint_type)
ins_4 = entry.op("OpAccessChain", [uniform_var, [int_0, ins_1]], ptr_uniform_uint)
entry.op("OpStore", [ins_4, ins_3, null])
entry.op("OpReturn", [])
This is the meat of this module. See that every operation with result is required to have a type. The null fields in OpLoad/OpStore are for memory access flags. They can describe whether the memory access is volatile, aligned or nontemporal.
The OpAccessChain retrieves a pointer from the composite objects. It is probably there to distinguish the shader variable access from a real load operation.
From the necessity this system needs quite a large load of datatype classes. I'm not entirely satisfied with how it looks like, but I'll be sure to do something for that very soon.
It is possible that I won't be using separate tools to describe these datatypes. I'll probably extend my CFFI to deal with the special cases required by SPIR-V.
I show some of them anyway, because they'll help to understand the system here now. Here's the void type class:
class OpTypeVoid extends OpType
visit = (self, builder):
return self if self in builder.visited
builder.visited.add(self)
add_decorate(builder, self)
builder.head.append(Tag("OpTypeVoid", [
builder.table.result(self)
]))
return self
Most of the type classes are very much alike, so I won't show more type classes, but here's a constant class:
class OpConstant extends Op
+init = (self, value, type):
self.value = value
self.type = type
visit = (self, builder):
return self if self in builder.visited
builder.visited.add(self)
add_decorate(builder, self)
builder.head.append(Tag("OpConstant", [
builder.table.as_type(self.type.visit(builder)),
builder.table.result(self),
self.value
]))
return self
The constant and variable classes are similar. It's pretty much a visit guard + code to write the instruction for the object.
The OpFunction and pals should be an interesting bunch:
class OpFunction extends Op
+init = (self, type, blocks=[]):
self.type = type
self.blocks = blocks
self.control = [] # https://www.khronos.org/registry/spir-v/specs/1.0/SPIRV.html#Function_Control
# None, Inline, DontInline, Pure, Const
build = (self, builder):
type = self.type.visit(builder)
builder.body.append(
Tag("OpFunction", [
builder.table.as_type(type.restype),
builder.table.result(self),
self.control,
builder.table.ref(type) ]))
for block in self.blocks
builder.body.append(
Tag("OpLabel", [builder.table.result(block)]))
for ins in block.body
ins.build(builder)
builder.body.append(
Tag("OpFunctionEnd", []))
new_block = (self):
block = OpBlock()
self.blocks.append(block)
return block
class OpBlock
+init = (self, body=[]):
self.body = body
op = (self, name, args, restype=null):
op = Operation(name, args, restype)
self.body.append(op)
return op
class Operation extends Op
+init = (self, name, args, restype=null):
self.name = name
self.args = args
self.restype = restype
visit = (self, builder):
assert self.restype, "has no result"
return self
build = (self, builder):
args = []
if self.restype
restype = self.restype.visit(builder)
args.append(builder.table.as_type(restype))
args.append(builder.table.result(self))
for arg in self.args
args.append(rename(builder, arg))
builder.body.append(
Tag(self.name, args))
The functions and operations are not 'visited', they are built directly into the program. I would probably write it all into a single function, but I still have to explore which kind of structure is best for generating the function forward declarations.
A peek into the future
There is obviously a lot of effort compared to just using GLSL & OpenGL compared to everything you see above. But it is really convenient if you eventually want to do the following:
import ffi, spirv_target
gl = spirv_target.Env()
result = spirv_target.StorageBuffer(
ffi.array(ffi.int),
{set = 0, binding = 0, name = "result"})
compute_main = ():
i = gl.invocation_id.x
result[i] = i
program = spirv_target.compute(compute_main)
To achieve this I have to write a partial evaluator for my language. It's halfway there but I have some critical details still missing.
There is added difficulty because I want to reuse the same partial evaluator for CPU and webassembly.
Ramifications of this tool can be dramatic. If it is easy to write GPU code, you will obviously write more of it. Your programs become faster simply because you'll have easier access to all the computing resources on your computer.
It doesn't end up there though, because the translator will obtain a lot of information about what we intended to do in the first place. It can be used to pre-fill some of the records required to establish the shader state, so that we won't need to do it later. We no longer need to define layouts twice so the code will begin to look like this:
gpu = GPU()
shader = program.load(gpu)
dsc0 = shader.DSC0()
dsc0.result = {
buffer = output_data.buffer
offset = 0
range = - 1
}
gpu.update(dsc0)
pipeline_layout = shader.pipeline_layout
Moreover, the possibility to easily transfer code between your GPU, optimized and dynamic codebase allows flexibility that has been never been there before. The program design can be explored much farther than ever before.
It'll be a perfect platform for programming many kinds of demanding applications.