A WebAssembly study
This new format might become a proper way to write portable binaries that run in every computer. That way it'd have massive boost to economy and commerce.
The format is meant to be text-serializable, but many compiler writers will likely want to target this one directly. It's going to be the easiest executable format around to target.
There's a demo in WebAssembly Binary Toolkit you likely want to check out. It produces a log out of the generated file from the text representation of the WebAssembly.
The format is still under heavy development so it's going to change. I did study it a bit. It's going to be a great format overall.
use of LEB128 number encoding
The binary encoding -page describes lot of data types. WebAssembly mostly uses the LEB128 encoding it calls varuintN. The numbering was confusing because the actual encoding doesn't change across different sized integers. The N digits just mean how large value is allowed.
It's a pretty brilliant variable size number encoding. The same encoding code can be used to encode both unsigned and signed numbers.
To encode a number this way, you break the number into 7-bit digits, use the eight bit in byte to mark the end of a number and store them in little-endian.
more = True
byte = x & 127
value >>= 7
if value == 0 and byte & 0x40 == 0:
more = False
elif value == -1 and byte & 0x40 != 0:
more = False
byte |= 0x80
On decoding a LEB128 number, you'll get signed representation by sign-extending the last 7-bit number before you piece them back together.
The sign extension of seventh bit is simple to do:
(value & 63) - (value & 64)
Here's the decoding part in whole for LEB128 as well:
value = 0
index = 0
byte = read_byte()
while byte & 128 > 0:
value |= (byte & 127) << 7*index
index += 1
byte = read_byte()
value |= ((byte & 63) - (byte & 64)) << 7*index
value |= (byte & 127) << 7*index
If I didn't do an error, the above code should work as long as you can represent large enough values.
There are not many strings in WebAssembly but there are some, the strings are UTF-8 -encoded and prefixed with a length using a varuint32 such as described before.
This is a nice way to encode strings. I'm happy they preferred it over separating the length into its own field because this is straightforward to encode and decode.
Can be mostly defined by a schema
I wrote a machine-readable schema file, and generated a JSON out of it using Lever's parsing tools. This schema is going to be outdated very fast, but it illustrates an important point that I was able to do so.
Each WebAssembly section can be mostly specified with following constructs:
- Lists that start with varuint32 denoting how many elements, followed by that many elements. I denote them with a star after the type of an element.
- Groups. I denote these with a keyword 'group' and braces containing sequences that have field name and the type of a field.
- Unions. Consists of a 'tag' describing what data comes after it.
- Constants (only in groups) that describe a field consisting of a constant left there for forwards-compatibility.
The advantage of a machine-readable schema is that it suffices to implement the above constructs. Then you can use the schema to decode your WebAssembly files into regular objects that are easier to handle.
There are two points where WebAssembly resists this kind of handling though. They have been well-identified.
Less resizable_limits please
The resizable_limits structure is a packed tuple that describes memory limits of the program and it appears in places points around WebAssembly spec.
This field is annoying because it's quite different from everything else. It's got a following format:
The 'maximum' -field is only present if the bit 0x1 in flags is set. This is troublesome because you have to handle this structure as a special-case.
If there are more of this kind of structures, then you could provide a machine-readable schema for it.
The section encoding
The WebAssembly file is partitioned into sections, and there's an order where they must appear. The details of section encoding is an another detail that is hard to describe in a schema, but it's important for the format so it is ok.
Each section consist of a section name and payload. This makes it easy to skip the uninteresting sections while reading.
The version=11 of WebAssembly has fairly simple section. It can be represented as follows:
That's it. You may guess the blob is similar to the string. It's prefixed with the size in varuint32 -field.
The WebAssembly version=12 is bit different.
If the section is a 'known' item. It's got a nonzero 'id'. For "user-defined" or unknown sections the 'id' is 0 and the name of the section comes first in the payload.
The opcodes in instructions of WebAssembly are not using the LEB128 encoding. The opcode bytes are interpreted directly and followed by immediates. Some of the immediates are LEB128-encoded and others are not.
The stack-oriented webassembly instructions are easily encoded and decoded but the effects of each instruction into the stack cannot be directly determined.
For example, to obtain the stack-effect of 'call' you have to go through function index table and fetch the function signature of the function it calls.
Most of the instructions themselves seem like something that'd easily appear in your average C program.
It'd be interesting to have documentation of how to do instruction selection on webassemly from C expressions. I suppose you won't escape doing that, but it'll be simpler than usual because you may merely optimize for the size of the binary and nothing else. Also there are no concerns over register allocation here.
The naming of instructions seem to well-represent what they operate on and how. There appears to be four distinct patterns in naming. Lets go through them with some examples.
The first part 'i32.' describes what type the operation is in the first place. The number after 'load' means it loads a 16-bit slot from the memory. '_u' means it treats the value unsigned, that it doesn't do sign-extension for the number.
The 'u' and 's' labels are only used if the signedness matters in the operation. For instance in store-operation whether it's signed or not doesn't matter, so none of those operations use 'u' or 's'.
This covers most of the naming, but there's still one part.
The slash notation is used whenever there is an instruction that consumes different value than it produces. There are handful of instructions that behave like this and mark the value they consume after a slash.
Control flow instructions
The control flow instructions in WebAssembly were most difficult to grasp with just documentation. I had to look how they behave to determine what they do.
The first tool you have for control flow is the 'block', 'loop', 'if' and 'end' -instructions. These come in pairs and form control flow blocks.
Any block can be escaped with 'br' and 'br_if' -instruction.
There's also a way to encode switch-case in WebAssembly. You use 'brtable' to do it. Create N nested blocks and use brtable on the top to jump specifically into one of them and then 'br' back.
The text format
WebAssembly is using a lisp-syntax to cleartext print these files. It's looking good and almost sufficient as long as you're just reading these files.
It reminds me of how I started out with Lever before I learned about the Marpa parser and made the utilities into my language to use the algorithm.
This WebAssembly could use the same treatment..
Finding out how 'br_table' works was underwhelming but otherwise this all seems nice and usable. I actually like this a bit more than SPIR-V!
The developers of WebAssembly told me they're going to release the 0xC version of the format soon. I'll wait until then before I try to generate WebAssembly files.