A WebAssembly study

I looked into WebAssembly this weekend. This is a format that, if successful, will allow other languages to supersede Javascript. The early version will improve C/C++ support in the web.

This new format might become a proper way to write portable binaries that run in every computer. That way it'd have massive boost to economy and commerce.

The format is meant to be text-serializable, but many compiler writers will likely want to target this one directly. It's going to be the easiest executable format around to target.

There's a demo in WebAssembly Binary Toolkit you likely want to check out. It produces a log out of the generated file from the text representation of the WebAssembly.

The format is still under heavy development so it's going to change. I did study it a bit. It's going to be a great format overall.

use of LEB128 number encoding

The binary encoding -page describes lot of data types. WebAssembly mostly uses the LEB128 encoding it calls varuintN. The numbering was confusing because the actual encoding doesn't change across different sized integers. The N digits just mean how large value is allowed.

It's a pretty brilliant variable size number encoding. The same encoding code can be used to encode both unsigned and signed numbers.

To encode a number this way, you break the number into 7-bit digits, use the eight bit in byte to mark the end of a number and store them in little-endian.

more = True
while more:
    byte = x & 127
    value >>= 7
    if value == 0 and byte & 0x40 == 0:
        more = False
    elif value == -1 and byte & 0x40 != 0:
        more = False
    else:
        byte |= 0x80
    write(byte)

On decoding a LEB128 number, you'll get signed representation by sign-extending the last 7-bit number before you piece them back together.

The sign extension of seventh bit is simple to do:

(value & 63) - (value & 64)

Here's the decoding part in whole for LEB128 as well:

value = 0
index = 0
byte = read_byte()
while byte & 128 > 0:
    value |= (byte & 127) << 7*index
    index += 1
    byte = read_byte()
if signed:
    value |= ((byte & 63) - (byte & 64)) << 7*index
else:
    value |= (byte & 127) << 7*index
return value

If I didn't do an error, the above code should work as long as you can represent large enough values.

String encoding

There are not many strings in WebAssembly but there are some, the strings are UTF-8 -encoded and prefixed with a length using a varuint32 such as described before.

This is a nice way to encode strings. I'm happy they preferred it over separating the length into its own field because this is straightforward to encode and decode.

Can be mostly defined by a schema

I wrote a machine-readable schema file, and generated a JSON out of it using Lever's parsing tools. This schema is going to be outdated very fast, but it illustrates an important point that I was able to do so.

Each WebAssembly section can be mostly specified with following constructs:

The advantage of a machine-readable schema is that it suffices to implement the above constructs. Then you can use the schema to decode your WebAssembly files into regular objects that are easier to handle.

There are two points where WebAssembly resists this kind of handling though. They have been well-identified.

Less resizable_limits please

The resizable_limits structure is a packed tuple that describes memory limits of the program and it appears in places points around WebAssembly spec.

This field is annoying because it's quite different from everything else. It's got a following format:

flags varuint32
initial varuint32
maximum varuint32

The 'maximum' -field is only present if the bit 0x1 in flags is set. This is troublesome because you have to handle this structure as a special-case.

If there are more of this kind of structures, then you could provide a machine-readable schema for it.

The section encoding

The WebAssembly file is partitioned into sections, and there's an order where they must appear. The details of section encoding is an another detail that is hard to describe in a schema, but it's important for the format so it is ok.

Each section consist of a section name and payload. This makes it easy to skip the uninteresting sections while reading.

The version=11 of WebAssembly has fairly simple section. It can be represented as follows:

group {
    name    string
    payload blob
}

That's it. You may guess the blob is similar to the string. It's prefixed with the size in varuint32 -field.

The WebAssembly version=12 is bit different.

group {
    id varuint7
    payload blob
}

If the section is a 'known' item. It's got a nonzero 'id'. For "user-defined" or unknown sections the 'id' is 0 and the name of the section comes first in the payload.

Instructions

The opcodes in instructions of WebAssembly are not using the LEB128 encoding. The opcode bytes are interpreted directly and followed by immediates. Some of the immediates are LEB128-encoded and others are not.

The stack-oriented webassembly instructions are easily encoded and decoded but the effects of each instruction into the stack cannot be directly determined.

For example, to obtain the stack-effect of 'call' you have to go through function index table and fetch the function signature of the function it calls.

Most of the instructions themselves seem like something that'd easily appear in your average C program.

It'd be interesting to have documentation of how to do instruction selection on webassemly from C expressions. I suppose you won't escape doing that, but it'll be simpler than usual because you may merely optimize for the size of the binary and nothing else. Also there are no concerns over register allocation here.

The naming of instructions seem to well-represent what they operate on and how. There appears to be four distinct patterns in naming. Lets go through them with some examples.

i32.load16_u

The first part 'i32.' describes what type the operation is in the first place. The number after 'load' means it loads a 16-bit slot from the memory. '_u' means it treats the value unsigned, that it doesn't do sign-extension for the number.

The 'u' and 's' labels are only used if the signedness matters in the operation. For instance in store-operation whether it's signed or not doesn't matter, so none of those operations use 'u' or 's'.

This covers most of the naming, but there's still one part.

f64.promote/f32

The slash notation is used whenever there is an instruction that consumes different value than it produces. There are handful of instructions that behave like this and mark the value they consume after a slash.

Control flow instructions

The control flow instructions in WebAssembly were most difficult to grasp with just documentation. I had to look how they behave to determine what they do.

The first tool you have for control flow is the 'block', 'loop', 'if' and 'end' -instructions. These come in pairs and form control flow blocks.

Any block can be escaped with 'br' and 'br_if' -instruction.

There's also a way to encode switch-case in WebAssembly. You use 'brtable' to do it. Create N nested blocks and use brtable on the top to jump specifically into one of them and then 'br' back.

The text format

WebAssembly is using a lisp-syntax to cleartext print these files. It's looking good and almost sufficient as long as you're just reading these files.

It reminds me of how I started out with Lever before I learned about the Marpa parser and made the utilities into my language to use the algorithm.

This WebAssembly could use the same treatment..

Done

Finding out how 'br_table' works was underwhelming but otherwise this all seems nice and usable. I actually like this a bit more than SPIR-V!

The developers of WebAssembly told me they're going to release the 0xC version of the format soon. I'll wait until then before I try to generate WebAssembly files.

Similar posts