RISC-V can be fun if they don't mess it up

Would you like to code on a fun assembly platform that's not from your past but from your future and has more than two registers? Would you like to not buy a new computer to try it out? I do! In this post I show how to try it out as well.

Lots of people are right when it comes to assembly languages. The complexity of the instruction set would not matter today. At least if you had it documented properly! Intel documentation consists of a 600 page PDF. Without machine-readable files describing the operational semantics of each instruction, you're looking into at least 2 years of work to support Intel architecture in its full glory. Except that projects such as google's EXEgesis have made a machine reader for Intel manuals. Unfortunately the information needed to support the Intel as a compiler target did not exist in the manuals.

Intel lacks the documentation for a compiler writer to support their platform. I guess the average response to this would be "We need no stinking compiler writers! We already have LLVM." I've never figured out what to answer to those guys. Push your pillow to the head, it's not my problem.

Eventually I have ceased all of my attempts to build low level code generators or JIT compilers from scratch. I decided to not mess with x86-64 architecture or instruction set anymore. Intel doesn't deserve free high-intensity support from hobbyists all over the world. They're just as piece of crap as the rest of the computing industry is. Therefore they all deserve the same amount of effort for support.

Fast forward to today, now there's a RISC-V simulator rv8 that works on x86-64. Most of it seems to be work of Michael Clark, who also worked on RISC-V stuff for qemu.

I tried RISC-V in this simulator and read through the ISA specification. It's in lots of ways the easiest platform I've used.

I don't repeat the build instructions for rv8 because they got those in the README.md in their github. Riscv gnu toolchain takes hours to install and rv8 takes few minutes on top of that.

If you're going to spend lot of time with this, you probably want to create directory riscv and put initscript.sh there:

# Invoke this with 'source initscript.sh' to quickly setup build env paths.
export RISCV=/path/to/riscv/toolchain
export RV8=/path/to/rv8
export PATH=$PATH:$RISCV/bin:$RV8/bin

Replace the /path/to -things with places where you got the riscv toolchain and rv8 installed.

"Raw" hello world

Next I'm going to show you a hello world that runs in the rv-jit. This is not the usual "hello", but one that's suited for people wanting to continue writing their own compiler backends.

We aren't using the linker to create the ELF executable file, instead we produce it using the assembly code. ELF is a complex weird file because Unix System Laboratories never invented directories. As long as you don't care about all the ELF details, you should be fine using the ELF file format. Remember to never read the ELF spec or feel a need for that, and you're doing fine!

We're using a bit unusual rules to compile the program. Here's the Makefile for you, you know what to do with it:

# There was a bug in the riscv toolchain preventing use of --oformat binary.
# Use of objcopy is a workaround for it.
mini-hello: mini-hello.elf
    riscv64-unknown-elf-objcopy -O binary $^ $@ 
    chmod +x $@

mini-hello.elf: listing.s
    riscv64-unknown-elf-gcc -nostartfiles $^ -o $@

After make, you can produce the finished program with rv-jit mini-hello. Next I provide the listing.s in multiple parts.

.section .text
.globl _start # ld nags without this.
.org 0
load_address = 0x010000 # can also use: 0x08048000

The weird .org 0 and load_address-hack is required because we link an executable by producing an .elf -file and then copy it out from its outer container.

This is the ELF file header. There's a bit of bytes in the header. I suggest using the readelf utility if you wonder what all of this means. Mostly it identifies what this file is, what machine it is and such.

ehdr:
    .byte 0x7f, 0x45, 0x4c, 0x46, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0
    .half 2                                 #   e_type
    .half 0xf3                              #   e_machine
    .word 1                                 #   e_version
    .quad load_address + _start - ehdr      #   e_entry
    .quad phdr - ehdr                       #   e_phoff
    .quad 0                                 #   e_shoff
    .word 0                                 #   e_flags
    .half ehdrsize                          #   e_ehsize    (64)
    .half phdrsize                          #   e_phentsize (56)
    .half 1                                 #   e_phnum
    .half 0                                 #   e_shentsize (64)
    .half 0                                 #   e_shnum
    .half 0                                 #   e_shstrndx
ehdrsize = . - ehdr

The last line calculates the size of the file header, so we won't need to write "56" anywhere. We're doing that here and there.

The two main parts of the ELF file are program and section headers. You don't need section headers in a loadable file like this. The program headers are read by the program loader in the Linux Kernel, probably in the order they are put in. Headers are fixed size entities with immensely important information on them that we mostly don't care about.

The only one that you need for now is a LOAD header.

phdr:
    .word 1                                 #   p_type
    .word 5                                 #   p_flags
    .quad 0                                 #   p_offset
    .quad load_address                      #   p_vaddr
    .quad load_address                      #   p_paddr
    .quad filesize                          #   p_filesz
    .quad filesize                          #   p_memsz
    .quad 0x1000                            #   p_align
phdrsize = . - phdr

If you wonder what this all means, riscv64-unknown-elf-readelf -l mini-hello tells that to you. Here's the output from that:

Elf file type is EXEC (Executable file)
Entry point 0x10078
There is 1 program header, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000010000 0x0000000000010000
                 0x00000000000000b7 0x00000000000000b7  R E    0x1000

Note the first three bytes in p_flags follow the rwx -file convention. They tell how the memory is protected. Remember that in modern systems you can't have write and execute byte enabled at the same time. This also applies to mmap and probably even to chmod -command.

The physical and file-address refer to the idea that the file image and memory image it represents are two different things. If you want empty space for something, you provide no file offset or size and only describe virtual address offset and size. In practice though, you don't want that as mmap can be called from the assembly code rather easily.

Alignment tells how the things are aligned in files and memory. If you need to pad things, remember to use zero padding and not the sensitive important secret documents the popular word processor used to use for padding in .doc files.

If you're really interested, you may also use the -h -flag to print the file header.

Next there's bit of magic numbers. Assembly programming produces lot of magic numbers and I suggest you name all of them and write down what they mean.

# Linux system call numbers https://rv8.io/syscalls.html
sys_write = 64
sys_exit  = 93
# Standard fileno numbers
stdin  = 0
stdout = 1
stderr = 2

Finally here's the program itself and the much-needed ending.

_start:
    li a0, stdout
    lla a1, greeting_text
    lw a2, -4(a1)
    li a7, sys_write
    ecall
    bne a0, a2, write_failed

    li a0, 0
    li a7, sys_exit
    ecall

write_failed:
    li a0, 1
    li a7, sys_exit
    ecall

    .word greeting_end - greeting_text
greeting_text:
    .string "Hello world\n"
greeting_end:

filesize = . - ehdr

Note I'm being a bit fancy here and catching the write failure.

The riscv-mini-hello is available for you to tinker in github. It's even got a license so you can make a product based on minimally greeting people with a computer.

Short introduction to the ISA

If you've been following very carefully, you may have been noticed that the RISC-V ISA doesn't have instructions that encode to zero. This is very helpful if your program escapes.

Something that's harder to notice is that position-independent code in RISC-V is so easy to write that you're not really wanting to write any other kind of code in it.

The instruction set is large and can grow into any direction, but all the instructions are grouped into categories. Programs flag which instructions they are using. It was either the ehdr or phdr where you mark it, you'll figure it out quickly (Actually didn't find it myself).

RV32I base instruction set provides you with the following:

Load long address / PC relative address
Jump and link (JAL)
Jump and link register (JALR)
Compare two registers and branch (BEQ,BNE,BLT,BGE,BLTU,BGEU)
Load/Save (LB,LH,LW,LBU,LHU,SB,SH,SW)
Add/Move (ADDI)
Set 1/0 if register less than the immediate value (SLTI,SLTIU)
Bunch of arithmetic (XORI,ORI,ANDI,SLLI,SRLI,SRAI,ADD,SUB,SLL,SLT,SLTU,XOR,SRL,SRA,OR,AND)
Fences (FENCE, FENCE.I)
System calls (ECALL)
Break to debugger (EBREAK)
System status register transfers (CSRRW,CSRRS,CSRRC,CSRRWI,CSRRSI,CSRRCI)

There are only 6 different encodings for these. The chicanery with the sign flag in the immediate field has been left at minimum. You can read more from the User-Level ISA spec about these instructions. It's very easy to find.

Also most of these basic things are very easy to understand. There's not much guesswork involved with the semantics of the instructions. Especially since there's no cflags register to confuse hardware designers.

Well for the FENCE you probably have to read the ISA manual. It's actually quite important as you need it for ordering memory write/reads between RISC-V "threads" harts, and for external I/O. This actually seemed very, very important so I gave it more look.

We need a memory model to give everyone a headache and ensure that there doesn't need to be communication between harts when there is none. I'm not quite sure about this, but there seem to be still work being done on this.

It'd seem every hart sees every memory store operation it did previously, but in the global scenery these harts see different order between operations. We should be thinking there's a global memory order that is total ordering of the memory operations produced by all harts, and it's not the same as what an individual hart sees in its little mind. Fences constrain the global memory order in respect to the hart.

The fence rw,rw is your full memory barrier. It means that no memory operations from before switch place with any memory operations after. Basically, the total memory order contribution of this hart must satisfy this rule.

The fence instruction consists of a predecessor and successor set flags. The predecessor refers to the set of operations preceding the fence. For example, the r would refer to all of local hart's operations that read to the memory.

Lets take an example.

sw x0, 0(s0)
fence w,r
lw x10, 0(s1)

With the fence, other harts see that this hart writes zero to s0 before reading value from s1.

In an another, extreme case, we might want to configure an audio chip before we let it play a note. We would do a output-output fence to ensure the device sees config being written before notes are sent.

Fences and memory models are something we all have to take seriously on future concurrently running systems. I hope everybody does.

The RV32I instructions are all 32-bit long, but there's also a way to get 16-bit instructions, 64-bit and 128-bit instructions. They've been forward thinking and the encoding of these do not collide with the existing 32-bit instructions, so they do not corrupt the program if mixed together in mistake.

There are 32 integer registers and the first register is always zero even if you wrote something to it.

The calling conventions for those registers are a bit difficult to remember. The following bitmasks describe them if you ever need to know it:

neither_save_registers = 0x19
caller_save_registers = 0xF003FCE2
callee_save_registers = 0x0FFC0304

  28   24   20   16   12    8    4    0
0000 0000 0000 0000 0000 0000 0001 1001 neither saves
1111 0000 0000 0011 1111 1100 1110 0010 caller saves
0000 1111 1111 1100 0000 0011 0000 0100 callee saves

If you manage to remember the callee-save register bitmask (faulty faulty caca oh three oh four), you're probably doing fine as also the three and four registers are the neither-save registers.

Drawing graphics in RV8-JIT

Before we get away from writing assembly code with an assembler. I'd like to show you few other things. I'm sure you're yearning to draw some graphics on the screen and I think it shouldn't be the hardest thing in the world to do.

The RV8-jit quite doesn't have everything to do it without help, but we can build a small SDL2 program that does most of the work. Lets start with the Makefile again.

SDL2_CFLAGS = $(shell pkg-config sdl2 --cflags)
SDL2_LIBS = $(shell pkg-config sdl2 --libs)

CFLAGS = $(SDL2_CFLAGS)
LIBS = $(SDL2_LIBS)

.PHONY: all
all: viewer maze

viewer: viewer.c
    gcc $(CFLAGS) -O2 $^ -o $@ $(LIBS)

# There was a bug in the riscv toolchain preventing use of --oformat binary.
# Use of objcopy is a workaround for it.
maze: maze.elf
    riscv64-unknown-elf-objcopy -O binary $^ $@ 
    chmod +x $@

maze.elf: listing.s slant_a.bmp slant_b.bmp
    riscv64-unknown-elf-gcc -nostartfiles listing.s -o $@

The slant_a.bmp and slant_b.bmp are the images we draw to the screen. We're going to draw a maze, and these are the two tile pieces we need.

I suggest that when you do SDL2 apps, use the pkg-config. It's a nifty little tool.

Here's the viewer.c, it's not so terrily long that we should leave it away.

/* Opens/Creates a shared file "window"
 * where you can draw stuff using mmap,
 * and it shows on the window that this program created. */
#include <assert.h> /* assert() */
#include <SDL.h>
#include <stdlib.h> /* exit() */
#include <math.h>
#include <unistd.h> /* truncate(), open() */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h> /* perror */
#include <sys/mman.h> /* mmap */

void check_sdl_error(int);

int main() {
    int result;
    SDL_Window *window;
    SDL_Renderer *renderer;
    SDL_Texture *screen;
    SDL_Event event;
    int done = 0;
    char *input_pixels;
    char *output_pixels;
    int pitch;

    int screen_width = 1280;
    int screen_height = 1024;
    int buffer_size = screen_width*screen_height*3;
    int fbfd;

    result = SDL_Init(SDL_INIT_VIDEO | SDL_INIT_AUDIO);
    check_sdl_error(result == 0);

    window = SDL_CreateWindow("viewer", 0, 0, screen_width, screen_height, 0);
    check_sdl_error(window != NULL);

    renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED);

    screen = SDL_CreateTexture(renderer,
        SDL_PIXELFORMAT_RGB888,
        SDL_TEXTUREACCESS_STREAMING,
        screen_width, screen_height);

    fbfd = open("window", O_RDWR | O_CREAT, 0660);
    if (fbfd == -1) {
        perror("cannot open 'window'");
        exit(1);
    }

    if (ftruncate(fbfd, buffer_size) == -1) {
        perror("cannot resize the file to correct size");
        exit(1);
    }

    input_pixels = mmap(NULL, buffer_size, PROT_READ|PROT_WRITE, MAP_SHARED, fbfd, 0);
    if (input_pixels == MAP_FAILED) {
        perror("mmap failed");
        exit(1);
    }

    while (!done) {
        while (SDL_PollEvent(&event)) {
            if (event.type == SDL_QUIT) {
                done = 1;
            }
        }

        // An alternative way to do the following,
        // though I felt the simple RGB pixel format was worth the extra effort.
        // SDL_UpdateTexture(screen, NULL, (void*)input_pixels, screen_width*4);

        SDL_LockTexture(screen, NULL, (void**)&output_pixels, &pitch);
        int i = 0;
        for (int y = 0; y < screen_height*pitch; y += pitch) {
            for (int x = 0; x < screen_width*4; x += 4) {
                output_pixels[y+x+0] = input_pixels[i+2];
                output_pixels[y+x+1] = input_pixels[i+1];
                output_pixels[y+x+2] = input_pixels[i+0];
                output_pixels[y+x+3] = 0;
                i += 3;
            }
        }
        SDL_UnlockTexture(screen);

        SDL_RenderClear(renderer);
        SDL_RenderCopy(renderer, screen, NULL, NULL);
        SDL_RenderPresent(renderer);
    }

    SDL_ShowWindow(window);
    SDL_DestroyTexture(screen);
    SDL_DestroyRenderer(renderer);
    SDL_DestroyWindow(window);
    SDL_Quit();
    return 0;
}

void check_sdl_error(int success) {
    if (!success) {
        SDL_LogError(SDL_LOG_CATEGORY_ERROR,
            "%s\n", SDL_GetError());
        exit(1);
    }
}

Note that all of the error checking isn't in place in this C program. There are probably few null pointers that I don't check. Also this thing isn't releasing the resources it allocated during errors. It should be okay as it exits in any case.

Ok, lets start going through the listing.s.

.section .text
.globl _start # ld nags without this.
.org 0
load_address = 0x010000 # can also use: 0x08048000
ehdr:
    .byte 0x7f, 0x45, 0x4c, 0x46, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0
    .half 2                                 #   e_type
    .half 0xf3                              #   e_machine
    .word 1                                 #   e_version
    .quad load_address + _start - ehdr      #   e_entry
    .quad phdr - ehdr                       #   e_phoff
    .quad 0                                 #   e_shoff
    .word 0                                 #   e_flags
    .half ehdrsize                          #   e_ehsize    (64)
    .half phdrsize                          #   e_phentsize (56)
    .half 1                                 #   e_phnum
    .half 0                                 #   e_shentsize (64)
    .half 0                                 #   e_shnum
    .half 0                                 #   e_shstrndx
ehdrsize = . - ehdr

phdr:
    .word 1                                 #   p_type
    .word 5                                 #   p_flags
    .quad 0                                 #   p_offset
    .quad load_address                      #   p_vaddr
    .quad load_address                      #   p_paddr
    .quad filesize                          #   p_filesz
    .quad filesize                          #   p_memsz
    .quad 0x1000                            #   p_align
phdrsize = . - phdr

# Linux system call numbers https://rv8.io/syscalls.html
sys_close = 57
sys_read  = 63
sys_write = 64
sys_exit  = 93
sys_brk     = 214
sys_munmmap = 215
sys_mmap = 222
sys_open = 1024
# Standard fileno numbers
stdin  = 0
stdout = 1
stderr = 2
# Some file flags we need
O_RDWR = 0x0002
PROT_READ  = 0x1
PROT_WRITE = 0x2
PROT_EXEC = 0x4
PROT_READ_WRITE = 0x3
MAP_SHARED  = 0x1
MAP_PRIVATE = 0x2
MAP_FIXED = 0x10
MAP_ANON = 0x20

We have the headers just like before, and there are few more codes for system calls.

The major differenc to the earlier is that we're building a frame into the stack. This frame contains all the things we're juggling around.

_start:
    # The gp frame.
    #  0: fileno
    #  8: screen buffer size
    # 16: screen buffer stride
    # 24: screen pixel width
    # 32: screen pixel height
    # 40: address to framebuffer
    # 48: lcg seed
    # 56: lcg multiplier
    # 64: lcg increment
    # 72: slant_a address (0)
    # 80: slant_a width   (8)
    # 88: slant_a height  (16)
    # 96: slant_b address (0)
    # 104: slant_b width   (8)
    # 112: slant_b height (16)
    # 120
    addi sp, sp, -120
    mv gp, sp

Note we could store all of this in the 32 integer registers we have, but it really made it much harder code to follow than otherwise.

Next we go on, set all those values. The first thing we set up is the framebuffer.

    lla a0, window_file_path
    li  a1, O_RDWR
    li  a2, 0
    li  a7, sys_open
    ecall
    blt a0, zero, program_failure
    sd a0, 0(gp)                            # fileno

    screen_buffer_size = 1280*1024*3
    lui a0,      %hi(screen_buffer_size)
    addi a0, a0, %lo(screen_buffer_size)
    sd a0, 8(gp)                            # screen buffer size

    addi a0, zero, 1280
    addi a0, a0,   1280
    addi a0, a0,   1280
    sd a0, 16(gp)                           # screen buffer stride

    addi a0, zero, 1280
    sd a0, 24(gp)                           # screen pixel width

    addi a0, zero, 1024
    sd a0, 32(gp)                           # screen pixel height

    li a0, 0
    ld a1, 8(gp)                            # screen buffer size
    li a2, PROT_READ_WRITE
    li a3, MAP_SHARED
    ld a4, 0(gp)                            # fileno
    li a5, 0                                # offset
    li a7, sys_mmap
    ecall
    blt a0, zero, program_failure
    sd a0, 40(gp)                           # address to framebuffer

Then we get the real time clock time from the CSR registers and use it to initialize the random number generator. The RNG algorithm used is the linear congruential generator. They used to be popular for the reasons you're going to see soon.

    # Configuration for the random number generator.
    rdtime a0
    lcg_multiplier = 1664525
    lui a1, %hi(lcg_multiplier)
    addi a1, a1, %lo(lcg_multiplier)
    lcg_increment = 1013904223
    lui a2, %hi(lcg_increment)
    addi a2, a2, %lo(lcg_increment)
    sd a0, 48(gp)
    sd a1, 56(gp)
    sd a2, 64(gp)

Next there would be some bitmaps to setup. The bitmaps are included into the binary because why not? They are small images.

    # Extract bitmaps from the images
    # The first two bytes in the bitmap images should be 42 4D
    # Otherwise this code doesn't work.
    lla a0, slant_a_bmp
    addi a1, gp, 72
    jal x1, extract_bitmap

    lla a0, slant_b_bmp
    addi a1, gp, 96
    jal x1, extract_bitmap
    j bitmaps_extracted

The extraction reads the BMP headers, finding the width, height and the offset to the image data.

extract_bitmap:
    lw a2, 10(a0)
    add a2, a2, a0
    lw a3, 18(a0) # width 
    lw a4, 22(a0) # height
    sd a2,  0(a1)
    sd a3,  8(a1)
    sd a4, 16(a1)
    jr x1
bitmaps_extracted:

I didn't get it working straight out when I made this. Especially when I added the frame I originally didn't have, that messed up things and I had to see what I was actually drawing to the screen. It helps if you clear up the screen first, like this!

    # The screen clearing procedure
    ld a0, 40(gp) # framebuffer address
    ld a1,  8(gp) # framebuffer size
    add a1, a1, a0
clear_screen:
    sb x0, 0(a0)
    add a0, a0, 1
    blt a0, a1, clear_screen

Next it's time to start drawing things. This part sets up the drawing loop. We're going to draw tiles.

    li s0, 128 # draw this many rows
    li s1, 80  # draw this many columns
    li s2, 0   # row counter
draw_a_row:
    li s3, 0   # column counter
draw_a_column:
    # Calculating screenbuffer offset to a0
    ld a0, 16(gp) # screen buffer stride
    li a1, 16*3   # tile width*3
    li a2,  8     # tile height
    mul a2, a2, a0
    mul a0, a2, s2
    mul a1, a1, s3
    add a0, a0, a1

Each tile is chosen by a random rice doll.

    # LCG Random number roll
    ld a2, 48(gp) # seed
    ld a3, 56(gp) # multiplier
    ld a4, 64(gp) # increment
    mul a2, a2, a3
    add a2, a2, a4
    sd a2, 48(gp)

    srl a2, a2, 24 # Grasping a bit.
    andi a2, a2, 1

    addi a1, gp, 72
    beq a2, zero, selected_slant_a
    addi a1, gp, 96
selected_slant_a:

Next the actual blitting happens. The image has to be blit row-by-row and each time a row ends, we want to crank the typewriter to the next place to draw. This part calculates the place where move after the line is ready.

    # OK, everything ready for a draw.
    # gp has the framebuffer
    # a0 has the offset
    # a1 has the image
    ld t0, 16(gp) # screen buffer stride
    ld a7, 8(a1)  # image width
    sub t0, t0, a7
    sub t0, t0, a7
    sub t0, t0, a7

Then it has to calculate where the drawing starts from.

    # Not doing overdraw protection so this is fairly simple.
    ld a2, 40(gp)       # framebuffer address
    add a2, a2, a0
    ld a3, 0(a1)        # image address

And finally setup a loop for drawing. The bge -instructions here prevent it going into loop if nothing needs to be done.

    li a4, 0
    ld a5, 16(a1) # image height
    li a6, 0
    ld a7, 8(a1)  # image width
    bge a6, a7, blit_done
    bge a4, a5, blit_done
pixel_blit:

Too late I noticed that the images were in a BGR format and upside down.

    # Pixel blit
    lb a0, 0(a3)
    sb a0, 2(a2)
    lb a0, 1(a3)
    sb a0, 1(a2)
    lb a0, 2(a3)
    sb a0, 0(a2)
    addi a3, a3, 3
    addi a2, a2, 3
    # Loop to the next pixel
    addi a6, a6, 1
    blt a6, a7, pixel_blit
    # Preparing for the next row
    li a6, 0
    add a2, a2, t0
    # Loop to the next row
    addi a4, a4, 1
    blt a4, a5, pixel_blit
blit_done:

Finally everything just needs to be repeated several hundred times and then the program has to stop.

    # Loop to the next column/row
    addi s3, s3, 1
    blt s3, s1, draw_a_column
    addi s2, s2, 1
    blt s2, s0, draw_a_row

    # The drawing has been finished.
    li a0, 0
    li a7, sys_exit
    ecall

program_failure:
    li a0, 1
    li a7, sys_exit
    ecall
    loop: j loop

All the data the program is using is stored in the end. Here's the part that includes the images into the binary and ensures you don't need to load them in as files.

window_file_path:
    .string "window"
    .byte 0

# 16x8x3 images, 54 byte header
slant_a_bmp: .incbin "slant_a.bmp"
slant_b_bmp: .incbin "slant_b.bmp"

filesize = . - ehdr

screenshot

The riscv-maze is also available in github and you can do whatever you want with it.

Writing your own assembler in Python

When you've had enough from the assembly, there's a natural next step you can take. You start writing your own macroassemblers or compilers, of course.

Python sucks in a lot of ways and is gradually getting worse. But although it has peaked, it really doesn't have a proper contender to surpass it yet.

Anyway, here's the mini-hello program written in our diy assembler:

from riscv_asm import *

load_address = 0x010000
_start = Group()
file_end = Group()
ehdr = Group()
phdr = Group()

ehdr.body.extend([
    bytes(0x7f, 0x45, 0x4c, 0x46, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    halfs(2, 0xf3),                         # e_type, e_machine
    words(1),                               # e_version
    quads(_start),                          # e_entry
    quads(Op(operator.sub, phdr, ehdr)),    # e_phoff
    quads(0),                               # e_shoff
    words(0),                               # e_flags
    halfs(64, 56),                          # e_ehsize, e_phentsize
    halfs(1, 0, 0, 0),                      # e_phnum
    check_size(64)])                        # e_shentsize (64)
                                            # e_shnum
                                            # e_shstrndx

file_size = Op(operator.sub, file_end, load_address)
phdr.body.extend([
    words(1, 5),                            # p_type, p_flags
    quads(0),                               # p_offset
    quads(load_address, load_address),      # p_vaddr, p_paddr           
    quads(file_size, file_size),            # p_filesz
    quads(0x1000),                          # p_align
    check_size(56)])                         

# Linux system call numbers https://rv8.io/syscalls.html
sys_write = 64
sys_exit  = 93
# Standard fileno numbers
stdin  = 0
stdout = 1
stderr = 2

greeting_text_str = "Hello world\n"
greeting_text_len = len(greeting_text_str)
greeting_text = Group([string(greeting_text_str)])

success = Group([
    addi(10, 0, 0),
    addi(17, 0, sys_exit),
    ecall ])

failure = Group([
    addi(10, 0, 1),
    addi(17, 0, sys_exit),
    ecall ])

program = Group([
    ehdr,
    phdr,
    _start,
    addi(10, 0, stdout),
    lli(11, greeting_text),
    addi(12, 0, greeting_text_len),
    addi(17, 0, sys_write),
    ecall,
    bne(10, 12, failure),
    success,
    failure,
    greeting_text,
    file_end])

assemble(load_address, program, "hello")

We are using our assembler to create the ELF files as well. It provides all the functionality we need to do that, so I suppose that's okay.

If you're going to use C code along your assembly, I suggest you also write ELF headers for generating object files so that you can feed the output to a linker.

Let's examine at the riscv_asm.py a bit.

"""
    riscv_asm
    ~~~~~~~~~

    A small riscv assembler
"""
from functools import partial
import operator
import os

We're going to implement a small language for generating binaries. It should be unsurprising to see us lift some functions into that language. We're also going to do a bit of currying, as in partially filling functions with parameters. Finally we're using chmod from the os module to change file permissions to executable, after we got a binary.

The assemble routine keeps rewriting the program until labels no longer change their place in the program. The program is also generating a list of errors. These are mainly values that do not fit into the structures. We're only looking at them after the labels have stopped shifting their place.

def assemble(org, program, out_file="a.out"):
    while True:
        block = []
        errors = []
        if program.encode(org, block, None, errors):
            break

    if len(errors) > 0:
        for fmt, info in errors:
            print fmt.format(*info)
    else:
        with open(out_file, "wb") as fd:
            fd.write("".join(block))
        os.chmod(out_file, 0755)

Groups are blocks, labels or sections in the program. Theoretically we could replace the "self.org" with a "self.offset", but there are potentially other problems to fix before that.

Encoders are wrapped functions that pass in additional values after the state.

class Group(object):
    def __init__(self, body=None):
        self.body = [] if body is None else body
        self.org = 0

    def encode(self, org, block, group, errors):
        ready = (self.org == org + len(block))
        self.org = org + len(block)
        for encoder in self.body:
            subready = encoder.encode(org, block, self, errors)
            ready = ready and subready
        return ready

class Encoder(object):
    def __init__(self, fn, values):
        self.fn = fn
        self.values = values

    def encode(self, org, block, group, errors):
        return self.fn(org, block, group, errors, self.values)

The generative parts of the assembler are encoded into the above structures. We also have plain computational parts for addresses. They are encoded using the Op -construct and resolved using the resolve(org, value) -command.

The here is a constant that gives the current origin to the encoder.

class Op(object):
    def __init__(self, fn, *args):
        self.fn = fn
        self.args = args

def resolve(org, arg):
    if isinstance(arg, Group):
        return arg.org
    elif isinstance(arg, Op):
        return arg.fn(*(resolve(org, arg) for arg in arg.args))
    elif arg is here:
        return org
    else:
        return arg

here = object()

The simple encoders take care of inserting data into the code.

def bytes(*args):
    return Encoder(enc_bytes, args)

def halfs(*args):
    return Encoder(enc_halfs, args)

def words(*args):
    return Encoder(enc_words, args)

def quads(*args):
    return Encoder(enc_quads, args)

def string(arg):
    return Encoder(enc_string, arg)

Then we have a tool that lets us check the size of a structure. So that we see it matches with what we expect.

def check_size(size):
    return Encoder(enc_check_size, size)

This check routine creates entries into the error list if the condition doesn't satisfy.

def check(errors, condition, fmt, *info):
    if not condition:
        errors.append((fmt, info))

Then come the actual generators for the above parts.

def enc_bytes(org, block, group, errors, args):
    for a in args:
        a = resolve(org + len(block), a)
        check(errors, 0 <= a <= 0xFF, "bytes {}", hex(a))
        block.append(chr(a))
    return True

def enc_halfs(org, block, group, errors, args):
    for a in args:
        a = resolve(org + len(block), a)
        check(errors, 0 <= a <= 0xFFFF, "halfs {}", hex(a))
        block.append(chr((a >> 0) & 255))
        block.append(chr((a >> 8) & 255))
    return True

def enc_words(org, block, group, errors, args):
    for a in args:
        a = resolve(org + len(block), a)
        check(errors, 0 <= a <= 0xFFFFFFFF, "words {}", hex(a))
        block.append(chr((a >>  0) & 255))
        block.append(chr((a >>  8) & 255))
        block.append(chr((a >> 16) & 255))
        block.append(chr((a >> 24) & 255))
    return True

def enc_quads(org, block, group, errors, args):
    for a in args:
        a = resolve(org + len(block), a)
        check(errors, 0 <= a <= 0xFFFFFFFFFFFFFFFF, "quads {}", hex(a))
        block.append(chr((a >>  0) & 255))
        block.append(chr((a >>  8) & 255))
        block.append(chr((a >> 16) & 255))
        block.append(chr((a >> 24) & 255))
        block.append(chr((a >> 32) & 255))
        block.append(chr((a >> 40) & 255))
        block.append(chr((a >> 48) & 255))
        block.append(chr((a >> 56) & 255))
    return True

def enc_string(org, block, group, errors, string):
    block.extend(string.encode('utf-8'))
    return True

def enc_check_size(org, block, group, errors, size):
    real_sz = org + len(block) - resolve(org + len(block), group)
    sz = resolve(org + len(block), size)
    check(errors, real_sz == sz, "size mismatch {} != {}", real_sz, sz)
    return True

For your own good, check the input ranges and treat signed/unsigned quantities as different types.

Next we got a fairly simple tool that we're using a little in building the thing, but not actually otherwise.

# Can be used to gauge if you got it right.
def str_template(template):
    vector = bin(template)[2:34].rjust(32, '0')
    return " ".join([
        vector[0:7], vector[7:12], vector[12:17],
        vector[17:20], vector[20:25], vector[25:]])
    return vector

To show you what this is doing.

>>> str_template(0)
'0000000 00000 00000 000 00000 0000000'

This is the "frame" that 32-bit long RISC-V instructions use. We can use it to check whether we've provided the right encoding. For example, the BGEU should be 0x7063 and it's using the B-type filling.

'0000000 00000 00000 111 00000 1100011'

The instruction frames themselves follow the form. For example, here's the J-type filling.

'_______ _____ _____ ___ XXXXX _______' rd register
'X______ _____ _____ ___ _____ _______' imm bit 20
'_XXXXXX XXXX_ _____ ___ _____ _______' imm bits 10:1
'_______ ____X _____ ___ _____ _______' imm bits 11
'_______ _____ XXXXX XXX _____ _______' imm bits 19:12

What are the shifting offsets and masks?

      25    20    15  12     7       0
'_______ _____ _____ ___ _____ XXXXXXX' 0x0000007F
'_______ _____ _____ ___ XXXXX _______' 0x00000F80
'_______ _____ _____ XXX _____ _______' 0x00007000
'_______ _____ XXXXX ___ _____ _______' 0x000F8000
'_______ XXXXX _____ ___ _____ _______' 0x01F00000
'XXXXXXX _____ _____ ___ _____ _______' 0xFE000000

So we can use this to construct the correct encoding for J-type filling and see that it holds:

>>> imm = 0x1FFFFF
>>> imm_20    = (imm << 11) & 0x80000000
>>> imm_10_1  = (imm << 20) & 0x7FE00000
>>> imm_11    = (imm << 9)  & 0x00100000
>>> imm_19_12 =  imm        & 0x000FF000
>>> str_template(imm_20)
'1000000 00000 00000 000 00000 0000000'
>>> str_template(imm_10_1)
'0111111 11110 00000 000 00000 0000000'
>>> str_template(imm_11)
'0000000 00001 00000 000 00000 0000000'
>>> str_template(imm_19_12)
'0000000 00000 11111 111 00000 0000000'

Just check that the sign bit gets on the right place.

Here's the program that encodes the RISC-V instructions. Don't trust the encodings. I may have made a mistake.

class I32Enc(object):
    def __init__(self, template, fn, *values):
        self.template = template
        self.fn = fn
        self.values = values

    def encode(self, org, block, group, errors):
        a = self.template | self.fn(org, block, group, errors, self.values)
        block.append(chr((a >>  0) & 255))
        block.append(chr((a >>  8) & 255))
        block.append(chr((a >> 16) & 255))
        block.append(chr((a >> 24) & 255))
        return True

def r_type(org, block, group, errors, (rd, rs1, rs2)):
    assert 0 <= rd < 32
    assert 0 <= rs1 < 32
    assert 0 <= rs2 < 32
    return (rd << 7) | (rs1 << 15) | (rs2 << 20)

def fence_type(org, block, group, (prec, succ)):
    imm = (prec & 0xF) << 4 | (succ & 0xF)
    return i_type(org, block, group, (0, 0, imm))

def i_type(org, block, group, errors, (rd, rs1, imm)):
    assert 0 <= rd < 32
    assert 0 <= rs1 < 32
    imm = resolve(org + len(block), imm)
    check(errors, -2048 <= imm < 2048, "imm field (i_type) {}", imm)
    imm_11_0 = (imm << 20) & 0xFFF00000
    return imm_11_0 | (rd << 7) | (rs1 << 15)

def s_type(org, block, group, errors, (rs1, rs2, imm)):
    assert 0 <= rs1 < 32
    assert 0 <= rs2 < 32
    imm = resolve(org + len(block), imm)
    check(errors, -2048 <= imm < 2048, "imm field (s_type) {}", imm)
    imm_11_5 = (imm << 20) & 0x7FF00000
    imm_4_0  = (imm <<  7) & 0x00000F80
    return imm_11_5 | imm_4_0 | (rs1 << 15) | (rs2 << 20)

def b_type(org, block, group, errors, (rs1, rs2, imm)):
    assert 0 <= rs1 < 32
    assert 0 <= rs2 < 32
    imm = resolve(org + len(block), imm) - (org + len(block))
    check(errors, -4096 <= imm < 4096, "imm field (b_type) {}", imm)
    assert imm & 1 == 0
    imm_12   = (imm << 19) & 0x80000000
    imm_10_5 = (imm << 15) & 0x7F000000
    imm_4_1  = (imm <<  7) & 0x00000F00
    imm_11   = (imm >>  4) & 0x00000080
    return imm_12 | imm_10_5 | imm_4_1 | imm_11 | (rs1 << 15) | (rs2 << 20)

def u_type(org, block, group, errors, (rd, imm)):
    assert 0 <= rd < 32
    imm = resolve(org + len(block), imm)
    assert imm & 0xFFF == 0
    return imm | (rd << 7)

def j_type(org, block, group, errors, (rd, imm)):
    assert 0 <= rd < 32
    imm = resolve(org + len(block), imm) - (org + len(block))
    check(errors, -1048576 <= imm < 1048576, "imm field (j_type) {}", imm)
    assert imm & 1 == 0
    imm_20    = (imm << 11) & 0x80000000
    imm_10_1  = (imm << 20) & 0x7FE00000
    imm_11    = (imm <<  9) & 0x00100000
    imm_19_12 =  imm        & 0x000FF000
    return imm_20 | imm_10_1 | imm_11 | imm_19_12 | (rd << 7)

Finally we provide the hi/lo -macros for splitting large values into lui/auipc and addi -combinations.

def calc_hi(value):
    if value > 0 and value & 2048 != 0:
        value = value + 4096
    if value < 0 and value & 2048 != 0:
        value = value + 4096
    return value ^ (value & 4095)

def calc_lo(value):
    k = value & 4095
    if k >= 2048:
        return ~(~k & 4095)
    return k

hi = partial(Op, calc_hi)
lo = partial(Op, calc_lo)

# Self-check for hi/lo calculators.
for x in range(-10000, 10000):
    assert calc_hi(x) + calc_lo(x) == x

def lli(rd, imm):
    rel = Group()
    offset = Op(operator.sub, imm, rel)
    rel.body.extend([
        auipc(rd, hi(offset)),
        addi(rd, rd, lo(offset)) ])
    return rel

Of course we do nothing with these without some encoding tables. Don't trust these either!

lui   = partial(I32Enc, 0x37, u_type)
auipc = partial(I32Enc, 0x17, u_type)
jal  = partial(I32Enc, 0x6F, j_type)
jalr = partial(I32Enc, 0x67, i_type)
beq  = partial(I32Enc, 0x0063, b_type)
bne  = partial(I32Enc, 0x1063, b_type)
blt  = partial(I32Enc, 0x4063, b_type)
bge  = partial(I32Enc, 0x5063, b_type)
bltu = partial(I32Enc, 0x6063, b_type)
bgeu = partial(I32Enc, 0x7063, b_type)
lb  = partial(I32Enc, 0x0003, i_type)
lh  = partial(I32Enc, 0x1003, i_type)
lw  = partial(I32Enc, 0x2003, i_type)
lbu = partial(I32Enc, 0x4003, i_type)
lhu = partial(I32Enc, 0x5003, i_type)
sb = partial(I32Enc, 0x0023, s_type)
sh = partial(I32Enc, 0x1023, s_type)
sw = partial(I32Enc, 0x2023, s_type)
addi  = partial(I32Enc, 0x0013, i_type)
slti  = partial(I32Enc, 0x2013, i_type)
sltiu = partial(I32Enc, 0x3013, i_type)
xori  = partial(I32Enc, 0x4013, i_type)
ori   = partial(I32Enc, 0x6013, i_type)
andi  = partial(I32Enc, 0x7013, i_type)

slli  = partial(I32Enc, 0x00001013, r_type)
srli  = partial(I32Enc, 0x00005013, r_type)
srai  = partial(I32Enc, 0x40005013, r_type)
add   = partial(I32Enc, 0x00000033, r_type)
sub   = partial(I32Enc, 0x40000033, r_type)
sll   = partial(I32Enc, 0x00001033, r_type)
slt   = partial(I32Enc, 0x00002033, r_type)
sltu  = partial(I32Enc, 0x00003033, r_type)
xor   = partial(I32Enc, 0x00004033, r_type)
srl   = partial(I32Enc, 0x00005033, r_type)
sra   = partial(I32Enc, 0x40005033, r_type)
or_   = partial(I32Enc, 0x00006033, r_type)
and_  = partial(I32Enc, 0x00007033, r_type)
fence   = partial(I32Enc, 0x000F, fence_type)
fence_i = words(0x100F)
ecall  = words(0x000073)
ebreak = words(0x100073)

csrrw  = partial(I32Enc, 0x1073, i_type)
csrrs  = partial(I32Enc, 0x2073, i_type)
csrrc  = partial(I32Enc, 0x3073, i_type)
csrrwi = partial(I32Enc, 0x5073, i_type)
csrrsi = partial(I32Enc, 0x6073, i_type)
csrrci = partial(I32Enc, 0x7073, i_type)

That's the whole thing for now. In practice you probably want to distinguish between register and constant arguments. In some instructions we should give them additional annotations because semantics. For example, in the SLLI, SRLI, SRAI instructions rs2 is shift-amount. The position in the argument list doesn't change, but the type is entirely different.

Here's still a little treat you may appreciate if you try to examine the bit fiddling in this program.

If you don't happen to remember how to convert between binary and hexadecimals. It should be easy by dividing the problem into quadrants:

     00xx 01xx 10xx 11xx
xx00 0    4    8    C
xx01 1    5    9    D
xx10 2    6    A    E
xx11 3    7    B    F

Trace left to obtain the lower two digits and up to obtain the higher two digits for the hexadecimal digit.

So if you were to look at what binary a hexadecimal refers to, go "zero, four, eight, C" to first increment left side, then increment by one until you get to the hexadecimal digit you have.

Likewise, you can examine two first digits of a 4-group binary and then the second two digits to table the hexadecimal digit for yourself.

To negate a number, you flip the bits around and increment it by one. Eg. ~x + 1.

The riscv-diy-asmcc contains the assembler presented here. If you want the exact source code for the code presented here, it's under the tag 2019_feb_25_blogpost.

Writing your own compiler in Python

There's a paper "Near-Optimal Instruction Selection on DAGs" that I've read and implemented at least once.

Linear register allocation or Chaitin-Briggs algorithm is probably needed as well.

There's also still the neat paper by Jorge Navas, "Horn-Clauses as an Intermediate Representation for Program Analysis". that I found interesting as a compiler writer.

The blog post is starting to be 7000 words long by now. I guess these are subjects for the another time.

Machine-readable specification

I absolutely love machine-readable specifications. Compilers are the first places where there should be lots and lots of them. There shouldn't really exist any single system software you can't or shouldn't be able to reimplement yourself with reasonable time and knowledge.

Projects such as implementing your own compiler and toolchain doesn't need to be just hobby projects, thanks to specifications that can be read by a machine. I did few useful tools to cater my previous programming language that anyone possibly wanting to write this kind of stuff might like to know.

The vkstruct provide some python scripts and macros for examining Vulkan graphics API's XML-formatted machine specs. It produces a json file that I used to implement primitive wrappers in Lever for Vulkan spec.
The cffi-gen is a set of tools to parse and examine C headers. Idea is that you write a Python script that implements few rewriting rules to produce a clean json-formatted header file that you can use to drive an FFI.

These haven't been very popular projects because it turns out lots of people love doing repetitive tasks and something like knitting but with a computer. Everybody loves knitting socks but few love making electronic knitting machines that make socks. Too bad those handmade SDL2 bindings do not warm your feet, yet they stink and are buggy!

Hardware to buy

When you finally get something fun done, you'd probably want hardware for it. There's sifive's hifive arduino board, the chinese sipeed/kendryte. NXP is making boards.

I guess I'll buy some board at some point, would prefer one with 5V GPIO and ADC ports. Doesn't harm if it has video output built in. If you wonder what I'm going to do, well.. It suffices to say I've been watching 8bitguy. He's got some real influence on people. Also been Arduino programming variety of things.

Who would like to follow this tutorial?

It's possible you just read it through for fun. But is your own compiler/assembler writing possibly something more than just a fun exercises? I don't know.

This opens up possibilities and ways to do things that have been closed off from most people for three decades now. Here I just exclaim you that "Hey! You can do this again!".

If you ask me, I didn't want to end up having to use common tools because I cannot make my own. I think with GCC and LLVM it's been steering to that direction for a while. I want to be in an situation that I use common tools because they're so good that there's no need to make better myself.