RISC-V can be fun if they don't mess it up
Would you like to code on a fun assembly platform that's not from your past but from your future and has more than two registers? Would you like to not buy a new computer to try it out? I do! In this post I show how to try it out as well.
Lots of people are right when it comes to assembly languages. The complexity of the instruction set would not matter today. At least if you had it documented properly! Intel documentation consists of a 600 page PDF. Without machine-readable files describing the operational semantics of each instruction, you're looking into at least 2 years of work to support Intel architecture in its full glory. Except that projects such as google's EXEgesis have made a machine reader for Intel manuals. Unfortunately the information needed to support the Intel as a compiler target did not exist in the manuals.
Intel lacks the documentation for a compiler writer to support their platform. I guess the average response to this would be "We need no stinking compiler writers! We already have LLVM." I've never figured out what to answer to those guys. Push your pillow to the head, it's not my problem.
Eventually I have ceased all of my attempts to build low level code generators or JIT compilers from scratch. I decided to not mess with x86-64 architecture or instruction set anymore. Intel doesn't deserve free high-intensity support from hobbyists all over the world. They're just as piece of crap as the rest of the computing industry is. Therefore they all deserve the same amount of effort for support.
Fast forward to today, now there's a RISC-V simulator rv8 that works on x86-64. Most of it seems to be work of Michael Clark, who also worked on RISC-V stuff for qemu.
I tried RISC-V in this simulator and read through the ISA specification. It's in lots of ways the easiest platform I've used.
I don't repeat the build instructions for rv8 because they got those in the README.md in their github. Riscv gnu toolchain takes hours to install and rv8 takes few minutes on top of that.
If you're going to spend lot of time with this,
you probably want to create directory riscv and put initscript.sh there:
# Invoke this with 'source initscript.sh' to quickly setup build env paths.export RISCV=/path/to/riscv/toolchainexport RV8=/path/to/rv8export PATH=$PATH:$RISCV/bin:$RV8/bin
Replace the /path/to -things with places where you got the riscv toolchain
and rv8 installed.
"Raw" hello world
Next I'm going to show you a hello world that runs in the rv-jit. This is not the usual "hello", but one that's suited for people wanting to continue writing their own compiler backends.
We aren't using the linker to create the ELF executable file, instead we produce it using the assembly code. ELF is a complex weird file because Unix System Laboratories never invented directories. As long as you don't care about all the ELF details, you should be fine using the ELF file format. Remember to never read the ELF spec or feel a need for that, and you're doing fine!
We're using a bit unusual rules to compile the program.
Here's the Makefile for you, you know what to do with it:
# There was a bug in the riscv toolchain preventing use of --oformat binary.# Use of objcopy is a workaround for it.mini-hello: mini-hello.elfriscv64-unknown-elf-objcopy -O binary $^ $@chmod +x $@mini-hello.elf: listing.sriscv64-unknown-elf-gcc -nostartfiles $^ -o $@
After make, you can produce the finished program with rv-jit mini-hello.
Next I provide the listing.s in multiple parts.
.section .text.globl _start # ld nags without this..org 0load_address = 0x010000 # can also use: 0x08048000
The weird .org 0 and load_address-hack is required because we link
an executable by producing an .elf -file
and then copy it out from its outer container.
This is the ELF file header. There's a bit of bytes in the header.
I suggest using the readelf utility if you wonder what all of this means.
Mostly it identifies what this file is, what machine it is and such.
ehdr:.byte 0x7f, 0x45, 0x4c, 0x46, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.half 2 # e_type.half 0xf3 # e_machine.word 1 # e_version.quad load_address + _start - ehdr # e_entry.quad phdr - ehdr # e_phoff.quad 0 # e_shoff.word 0 # e_flags.half ehdrsize # e_ehsize (64).half phdrsize # e_phentsize (56).half 1 # e_phnum.half 0 # e_shentsize (64).half 0 # e_shnum.half 0 # e_shstrndxehdrsize = . - ehdr
The last line calculates the size of the file header, so we won't need to write "56" anywhere. We're doing that here and there.
The two main parts of the ELF file are program and section headers. You don't need section headers in a loadable file like this. The program headers are read by the program loader in the Linux Kernel, probably in the order they are put in. Headers are fixed size entities with immensely important information on them that we mostly don't care about.
The only one that you need for now is a LOAD header.
phdr:.word 1 # p_type.word 5 # p_flags.quad 0 # p_offset.quad load_address # p_vaddr.quad load_address # p_paddr.quad filesize # p_filesz.quad filesize # p_memsz.quad 0x1000 # p_alignphdrsize = . - phdr
If you wonder what this all means,
riscv64-unknown-elf-readelf -l mini-hello tells that to you.
Here's the output from that:
Elf file type is EXEC (Executable file)Entry point 0x10078There is 1 program header, starting at offset 64Program Headers:Type Offset VirtAddr PhysAddrFileSiz MemSiz Flags AlignLOAD 0x0000000000000000 0x0000000000010000 0x00000000000100000x00000000000000b7 0x00000000000000b7 R E 0x1000
Note the first three bytes in p_flags follow the rwx -file convention.
They tell how the memory is protected.
Remember that in modern systems
you can't have write and execute byte enabled at the same time.
This also applies to mmap and probably even to chmod -command.
The physical and file-address refer to the idea that the file image and
memory image it represents are two different things.
If you want empty space for something,
you provide no file offset or size
and only describe virtual address offset and size.
In practice though, you don't want that as mmap
can be called from the assembly code rather easily.
Alignment tells how the things are aligned in files and memory.
If you need to pad things, remember to use zero padding
and not the sensitive important secret documents
the popular word processor used to use for padding in .doc files.
If you're really interested,
you may also use the -h -flag to print the file header.
Next there's bit of magic numbers. Assembly programming produces lot of magic numbers and I suggest you name all of them and write down what they mean.
# Linux system call numbers https://rv8.io/syscalls.htmlsys_write = 64sys_exit = 93# Standard fileno numbersstdin = 0stdout = 1stderr = 2
Finally here's the program itself and the much-needed ending.
_start:li a0, stdoutlla a1, greeting_textlw a2, -4(a1)li a7, sys_writeecallbne a0, a2, write_failedli a0, 0li a7, sys_exitecallwrite_failed:li a0, 1li a7, sys_exitecall.word greeting_end - greeting_textgreeting_text:.string "Hello world\n"greeting_end:filesize = . - ehdr
Note I'm being a bit fancy here and catching the write failure.
The riscv-mini-hello is available for you to tinker in github. It's even got a license so you can make a product based on minimally greeting people with a computer.
Short introduction to the ISA
If you've been following very carefully, you may have been noticed that the RISC-V ISA doesn't have instructions that encode to zero. This is very helpful if your program escapes.
Something that's harder to notice is that position-independent code in RISC-V is so easy to write that you're not really wanting to write any other kind of code in it.
The instruction set is large and can grow into any direction, but all the instructions are grouped into categories. Programs flag which instructions they are using. It was either the ehdr or phdr where you mark it, you'll figure it out quickly (Actually didn't find it myself).
RV32I base instruction set provides you with the following:
- Load long address / PC relative address
- Jump and link (JAL)
- Jump and link register (JALR)
- Compare two registers and branch (BEQ,BNE,BLT,BGE,BLTU,BGEU)
- Load/Save (LB,LH,LW,LBU,LHU,SB,SH,SW)
- Add/Move (ADDI)
- Set 1/0 if register less than the immediate value (SLTI,SLTIU)
- Bunch of arithmetic (XORI,ORI,ANDI,SLLI,SRLI,SRAI,ADD,SUB,SLL,SLT,SLTU,XOR,SRL,SRA,OR,AND)
- Fences (FENCE, FENCE.I)
- System calls (ECALL)
- Break to debugger (EBREAK)
- System status register transfers (CSRRW,CSRRS,CSRRC,CSRRWI,CSRRSI,CSRRCI)
There are only 6 different encodings for these. The chicanery with the sign flag in the immediate field has been left at minimum. You can read more from the User-Level ISA spec about these instructions. It's very easy to find.
Also most of these basic things are very easy to understand. There's not much guesswork involved with the semantics of the instructions. Especially since there's no cflags register to confuse hardware designers.
Well for the FENCE you probably have to read the ISA manual.
It's actually quite important as you need it for ordering memory write/reads
between RISC-V "threads" harts, and for external I/O.
This actually seemed very, very important so I gave it more look.
We need a memory model to give everyone a headache and ensure that there doesn't need to be communication between harts when there is none. I'm not quite sure about this, but there seem to be still work being done on this.
It'd seem every hart sees every memory store operation it did previously, but in the global scenery these harts see different order between operations. We should be thinking there's a global memory order that is total ordering of the memory operations produced by all harts, and it's not the same as what an individual hart sees in its little mind. Fences constrain the global memory order in respect to the hart.
The fence rw,rw is your full memory barrier.
It means that no memory operations from before
switch place with any memory operations after.
Basically, the total memory order contribution
of this hart must satisfy this rule.
The fence instruction consists of a predecessor and successor set flags.
The predecessor refers to the set of operations preceding the fence.
For example, the r would refer to all of local hart's operations
that read to the memory.
Lets take an example.
sw x0, 0(s0)fence w,rlw x10, 0(s1)
With the fence, other harts see that this hart writes zero to s0
before reading value from s1.
In an another, extreme case, we might want to configure an audio chip before we let it play a note. We would do a output-output fence to ensure the device sees config being written before notes are sent.
Fences and memory models are something we all have to take seriously on future concurrently running systems. I hope everybody does.
The RV32I instructions are all 32-bit long, but there's also a way to get 16-bit instructions, 64-bit and 128-bit instructions. They've been forward thinking and the encoding of these do not collide with the existing 32-bit instructions, so they do not corrupt the program if mixed together in mistake.
There are 32 integer registers and the first register is always zero even if you wrote something to it.
The calling conventions for those registers are a bit difficult to remember. The following bitmasks describe them if you ever need to know it:
neither_save_registers = 0x19caller_save_registers = 0xF003FCE2callee_save_registers = 0x0FFC030428 24 20 16 12 8 4 00000 0000 0000 0000 0000 0000 0001 1001 neither saves1111 0000 0000 0011 1111 1100 1110 0010 caller saves0000 1111 1111 1100 0000 0011 0000 0100 callee saves
If you manage to remember the callee-save register bitmask
(faulty faulty caca oh three oh four),
you're probably doing fine as also the
three and four registers are the neither-save registers.
Drawing graphics in RV8-JIT
Before we get away from writing assembly code with an assembler. I'd like to show you few other things. I'm sure you're yearning to draw some graphics on the screen and I think it shouldn't be the hardest thing in the world to do.
The RV8-jit quite doesn't have everything to do it without help,
but we can build a small SDL2 program that does most of the work.
Lets start with the Makefile again.
SDL2_CFLAGS = $(shell pkg-config sdl2 --cflags)SDL2_LIBS = $(shell pkg-config sdl2 --libs)CFLAGS = $(SDL2_CFLAGS)LIBS = $(SDL2_LIBS).PHONY: allall: viewer mazeviewer: viewer.cgcc $(CFLAGS) -O2 $^ -o $@ $(LIBS)# There was a bug in the riscv toolchain preventing use of --oformat binary.# Use of objcopy is a workaround for it.maze: maze.elfriscv64-unknown-elf-objcopy -O binary $^ $@chmod +x $@maze.elf: listing.s slant_a.bmp slant_b.bmpriscv64-unknown-elf-gcc -nostartfiles listing.s -o $@
The slant_a.bmp and slant_b.bmp are the images we draw to the screen.
We're going to draw a maze, and these are the two tile pieces we need.
I suggest that when you do SDL2 apps, use the pkg-config.
It's a nifty little tool.
Here's the viewer.c, it's not so terrily long that we should leave it away.
/* Opens/Creates a shared file "window"* where you can draw stuff using mmap,* and it shows on the window that this program created. */#include <assert.h> /* assert() */#include <SDL.h>#include <stdlib.h> /* exit() */#include <math.h>#include <unistd.h> /* truncate(), open() */#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h> /* perror */#include <sys/mman.h> /* mmap */void check_sdl_error(int);int main() {int result;SDL_Window *window;SDL_Renderer *renderer;SDL_Texture *screen;SDL_Event event;int done = 0;char *input_pixels;char *output_pixels;int pitch;int screen_width = 1280;int screen_height = 1024;int buffer_size = screen_width*screen_height*3;int fbfd;result = SDL_Init(SDL_INIT_VIDEO | SDL_INIT_AUDIO);check_sdl_error(result == 0);window = SDL_CreateWindow("viewer", 0, 0, screen_width, screen_height, 0);check_sdl_error(window != NULL);renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED);screen = SDL_CreateTexture(renderer,SDL_PIXELFORMAT_RGB888,SDL_TEXTUREACCESS_STREAMING,screen_width, screen_height);fbfd = open("window", O_RDWR | O_CREAT, 0660);if (fbfd == -1) {perror("cannot open 'window'");exit(1);}if (ftruncate(fbfd, buffer_size) == -1) {perror("cannot resize the file to correct size");exit(1);}input_pixels = mmap(NULL, buffer_size, PROT_READ|PROT_WRITE, MAP_SHARED, fbfd, 0);if (input_pixels == MAP_FAILED) {perror("mmap failed");exit(1);}while (!done) {while (SDL_PollEvent(&event)) {if (event.type == SDL_QUIT) {done = 1;}}// An alternative way to do the following,// though I felt the simple RGB pixel format was worth the extra effort.// SDL_UpdateTexture(screen, NULL, (void*)input_pixels, screen_width*4);SDL_LockTexture(screen, NULL, (void**)&output_pixels, &pitch);int i = 0;for (int y = 0; y < screen_height*pitch; y += pitch) {for (int x = 0; x < screen_width*4; x += 4) {output_pixels[y+x+0] = input_pixels[i+2];output_pixels[y+x+1] = input_pixels[i+1];output_pixels[y+x+2] = input_pixels[i+0];output_pixels[y+x+3] = 0;i += 3;}}SDL_UnlockTexture(screen);SDL_RenderClear(renderer);SDL_RenderCopy(renderer, screen, NULL, NULL);SDL_RenderPresent(renderer);}SDL_ShowWindow(window);SDL_DestroyTexture(screen);SDL_DestroyRenderer(renderer);SDL_DestroyWindow(window);SDL_Quit();return 0;}void check_sdl_error(int success) {if (!success) {SDL_LogError(SDL_LOG_CATEGORY_ERROR,"%s\n", SDL_GetError());exit(1);}}
Note that all of the error checking isn't in place in this C program. There are probably few null pointers that I don't check. Also this thing isn't releasing the resources it allocated during errors. It should be okay as it exits in any case.
Ok, lets start going through the listing.s.
.section .text.globl _start # ld nags without this..org 0load_address = 0x010000 # can also use: 0x08048000ehdr:.byte 0x7f, 0x45, 0x4c, 0x46, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.half 2 # e_type.half 0xf3 # e_machine.word 1 # e_version.quad load_address + _start - ehdr # e_entry.quad phdr - ehdr # e_phoff.quad 0 # e_shoff.word 0 # e_flags.half ehdrsize # e_ehsize (64).half phdrsize # e_phentsize (56).half 1 # e_phnum.half 0 # e_shentsize (64).half 0 # e_shnum.half 0 # e_shstrndxehdrsize = . - ehdrphdr:.word 1 # p_type.word 5 # p_flags.quad 0 # p_offset.quad load_address # p_vaddr.quad load_address # p_paddr.quad filesize # p_filesz.quad filesize # p_memsz.quad 0x1000 # p_alignphdrsize = . - phdr# Linux system call numbers https://rv8.io/syscalls.htmlsys_close = 57sys_read = 63sys_write = 64sys_exit = 93sys_brk = 214sys_munmmap = 215sys_mmap = 222sys_open = 1024# Standard fileno numbersstdin = 0stdout = 1stderr = 2# Some file flags we needO_RDWR = 0x0002PROT_READ = 0x1PROT_WRITE = 0x2PROT_EXEC = 0x4PROT_READ_WRITE = 0x3MAP_SHARED = 0x1MAP_PRIVATE = 0x2MAP_FIXED = 0x10MAP_ANON = 0x20
We have the headers just like before, and there are few more codes for system calls.
The major differenc to the earlier is that we're building a frame into the stack. This frame contains all the things we're juggling around.
_start:# The gp frame.# 0: fileno# 8: screen buffer size# 16: screen buffer stride# 24: screen pixel width# 32: screen pixel height# 40: address to framebuffer# 48: lcg seed# 56: lcg multiplier# 64: lcg increment# 72: slant_a address (0)# 80: slant_a width (8)# 88: slant_a height (16)# 96: slant_b address (0)# 104: slant_b width (8)# 112: slant_b height (16)# 120addi sp, sp, -120mv gp, sp
Note we could store all of this in the 32 integer registers we have, but it really made it much harder code to follow than otherwise.
Next we go on, set all those values. The first thing we set up is the framebuffer.
lla a0, window_file_pathli a1, O_RDWRli a2, 0li a7, sys_openecallblt a0, zero, program_failuresd a0, 0(gp) # filenoscreen_buffer_size = 1280*1024*3lui a0, %hi(screen_buffer_size)addi a0, a0, %lo(screen_buffer_size)sd a0, 8(gp) # screen buffer sizeaddi a0, zero, 1280addi a0, a0, 1280addi a0, a0, 1280sd a0, 16(gp) # screen buffer strideaddi a0, zero, 1280sd a0, 24(gp) # screen pixel widthaddi a0, zero, 1024sd a0, 32(gp) # screen pixel heightli a0, 0ld a1, 8(gp) # screen buffer sizeli a2, PROT_READ_WRITEli a3, MAP_SHAREDld a4, 0(gp) # filenoli a5, 0 # offsetli a7, sys_mmapecallblt a0, zero, program_failuresd a0, 40(gp) # address to framebuffer
Then we get the real time clock time from the CSR registers and use it to initialize the random number generator. The RNG algorithm used is the linear congruential generator. They used to be popular for the reasons you're going to see soon.
# Configuration for the random number generator.rdtime a0lcg_multiplier = 1664525lui a1, %hi(lcg_multiplier)addi a1, a1, %lo(lcg_multiplier)lcg_increment = 1013904223lui a2, %hi(lcg_increment)addi a2, a2, %lo(lcg_increment)sd a0, 48(gp)sd a1, 56(gp)sd a2, 64(gp)
Next there would be some bitmaps to setup. The bitmaps are included into the binary because why not? They are small images.
# Extract bitmaps from the images# The first two bytes in the bitmap images should be 42 4D# Otherwise this code doesn't work.lla a0, slant_a_bmpaddi a1, gp, 72jal x1, extract_bitmaplla a0, slant_b_bmpaddi a1, gp, 96jal x1, extract_bitmapj bitmaps_extracted
The extraction reads the BMP headers, finding the width, height and the offset to the image data.
extract_bitmap:lw a2, 10(a0)add a2, a2, a0lw a3, 18(a0) # widthlw a4, 22(a0) # heightsd a2, 0(a1)sd a3, 8(a1)sd a4, 16(a1)jr x1bitmaps_extracted:
I didn't get it working straight out when I made this. Especially when I added the frame I originally didn't have, that messed up things and I had to see what I was actually drawing to the screen. It helps if you clear up the screen first, like this!
# The screen clearing procedureld a0, 40(gp) # framebuffer addressld a1, 8(gp) # framebuffer sizeadd a1, a1, a0clear_screen:sb x0, 0(a0)add a0, a0, 1blt a0, a1, clear_screen
Next it's time to start drawing things. This part sets up the drawing loop. We're going to draw tiles.
li s0, 128 # draw this many rowsli s1, 80 # draw this many columnsli s2, 0 # row counterdraw_a_row:li s3, 0 # column counterdraw_a_column:# Calculating screenbuffer offset to a0ld a0, 16(gp) # screen buffer strideli a1, 16*3 # tile width*3li a2, 8 # tile heightmul a2, a2, a0mul a0, a2, s2mul a1, a1, s3add a0, a0, a1
Each tile is chosen by a random rice doll.
# LCG Random number rollld a2, 48(gp) # seedld a3, 56(gp) # multiplierld a4, 64(gp) # incrementmul a2, a2, a3add a2, a2, a4sd a2, 48(gp)srl a2, a2, 24 # Grasping a bit.andi a2, a2, 1addi a1, gp, 72beq a2, zero, selected_slant_aaddi a1, gp, 96selected_slant_a:
Next the actual blitting happens. The image has to be blit row-by-row and each time a row ends, we want to crank the typewriter to the next place to draw. This part calculates the place where move after the line is ready.
# OK, everything ready for a draw.# gp has the framebuffer# a0 has the offset# a1 has the imageld t0, 16(gp) # screen buffer strideld a7, 8(a1) # image widthsub t0, t0, a7sub t0, t0, a7sub t0, t0, a7
Then it has to calculate where the drawing starts from.
# Not doing overdraw protection so this is fairly simple.ld a2, 40(gp) # framebuffer addressadd a2, a2, a0ld a3, 0(a1) # image address
And finally setup a loop for drawing. The bge -instructions here prevent it going into loop if nothing needs to be done.
li a4, 0ld a5, 16(a1) # image heightli a6, 0ld a7, 8(a1) # image widthbge a6, a7, blit_donebge a4, a5, blit_donepixel_blit:
Too late I noticed that the images were in a BGR format and upside down.
# Pixel blitlb a0, 0(a3)sb a0, 2(a2)lb a0, 1(a3)sb a0, 1(a2)lb a0, 2(a3)sb a0, 0(a2)addi a3, a3, 3addi a2, a2, 3# Loop to the next pixeladdi a6, a6, 1blt a6, a7, pixel_blit# Preparing for the next rowli a6, 0add a2, a2, t0# Loop to the next rowaddi a4, a4, 1blt a4, a5, pixel_blitblit_done:
Finally everything just needs to be repeated several hundred times and then the program has to stop.
# Loop to the next column/rowaddi s3, s3, 1blt s3, s1, draw_a_columnaddi s2, s2, 1blt s2, s0, draw_a_row# The drawing has been finished.li a0, 0li a7, sys_exitecallprogram_failure:li a0, 1li a7, sys_exitecallloop: j loop
All the data the program is using is stored in the end. Here's the part that includes the images into the binary and ensures you don't need to load them in as files.
window_file_path:.string "window".byte 0# 16x8x3 images, 54 byte headerslant_a_bmp: .incbin "slant_a.bmp"slant_b_bmp: .incbin "slant_b.bmp"filesize = . - ehdr

The riscv-maze is also available in github and you can do whatever you want with it.
Writing your own assembler in Python
When you've had enough from the assembly, there's a natural next step you can take. You start writing your own macroassemblers or compilers, of course.
Python sucks in a lot of ways and is gradually getting worse. But although it has peaked, it really doesn't have a proper contender to surpass it yet.
Anyway, here's the mini-hello program written in our diy assembler:
from riscv_asm import *load_address = 0x010000_start = Group()file_end = Group()ehdr = Group()phdr = Group()ehdr.body.extend([bytes(0x7f, 0x45, 0x4c, 0x46, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0),halfs(2, 0xf3), # e_type, e_machinewords(1), # e_versionquads(_start), # e_entryquads(Op(operator.sub, phdr, ehdr)), # e_phoffquads(0), # e_shoffwords(0), # e_flagshalfs(64, 56), # e_ehsize, e_phentsizehalfs(1, 0, 0, 0), # e_phnumcheck_size(64)]) # e_shentsize (64)# e_shnum# e_shstrndxfile_size = Op(operator.sub, file_end, load_address)phdr.body.extend([words(1, 5), # p_type, p_flagsquads(0), # p_offsetquads(load_address, load_address), # p_vaddr, p_paddrquads(file_size, file_size), # p_fileszquads(0x1000), # p_aligncheck_size(56)])# Linux system call numbers https://rv8.io/syscalls.htmlsys_write = 64sys_exit = 93# Standard fileno numbersstdin = 0stdout = 1stderr = 2greeting_text_str = "Hello world\n"greeting_text_len = len(greeting_text_str)greeting_text = Group([string(greeting_text_str)])success = Group([addi(10, 0, 0),addi(17, 0, sys_exit),ecall ])failure = Group([addi(10, 0, 1),addi(17, 0, sys_exit),ecall ])program = Group([ehdr,phdr,_start,addi(10, 0, stdout),lli(11, greeting_text),addi(12, 0, greeting_text_len),addi(17, 0, sys_write),ecall,bne(10, 12, failure),success,failure,greeting_text,file_end])assemble(load_address, program, "hello")
We are using our assembler to create the ELF files as well. It provides all the functionality we need to do that, so I suppose that's okay.
If you're going to use C code along your assembly, I suggest you also write ELF headers for generating object files so that you can feed the output to a linker.
Let's examine at the riscv_asm.py a bit.
"""riscv_asm~~~~~~~~~A small riscv assembler"""from functools import partialimport operatorimport os
We're going to implement a small language for generating binaries.
It should be unsurprising to see us lift some functions into that language.
We're also going to do a bit of
currying,
as in partially filling functions with parameters.
Finally we're using chmod from the os module to change file permissions
to executable, after we got a binary.
The assemble routine keeps rewriting the program until labels no longer
change their place in the program.
The program is also generating a list of errors.
These are mainly values that do not fit into the structures.
We're only looking at them after the labels have stopped shifting their place.
def assemble(org, program, out_file="a.out"):while True:block = []errors = []if program.encode(org, block, None, errors):breakif len(errors) > 0:for fmt, info in errors:print fmt.format(*info)else:with open(out_file, "wb") as fd:fd.write("".join(block))os.chmod(out_file, 0755)
Groups are blocks, labels or sections in the program. Theoretically we could replace the "self.org" with a "self.offset", but there are potentially other problems to fix before that.
Encoders are wrapped functions that pass in additional values after the state.
class Group(object):def __init__(self, body=None):self.body = [] if body is None else bodyself.org = 0def encode(self, org, block, group, errors):ready = (self.org == org + len(block))self.org = org + len(block)for encoder in self.body:subready = encoder.encode(org, block, self, errors)ready = ready and subreadyreturn readyclass Encoder(object):def __init__(self, fn, values):self.fn = fnself.values = valuesdef encode(self, org, block, group, errors):return self.fn(org, block, group, errors, self.values)
The generative parts of the assembler are encoded into the above structures.
We also have plain computational parts for addresses.
They are encoded using the Op -construct and resolved using the
resolve(org, value) -command.
The here is a constant that gives the current origin to the encoder.
class Op(object):def __init__(self, fn, *args):self.fn = fnself.args = argsdef resolve(org, arg):if isinstance(arg, Group):return arg.orgelif isinstance(arg, Op):return arg.fn(*(resolve(org, arg) for arg in arg.args))elif arg is here:return orgelse:return arghere = object()
The simple encoders take care of inserting data into the code.
def bytes(*args):return Encoder(enc_bytes, args)def halfs(*args):return Encoder(enc_halfs, args)def words(*args):return Encoder(enc_words, args)def quads(*args):return Encoder(enc_quads, args)def string(arg):return Encoder(enc_string, arg)
Then we have a tool that lets us check the size of a structure. So that we see it matches with what we expect.
def check_size(size):return Encoder(enc_check_size, size)
This check routine creates entries into the error list if the condition doesn't satisfy.
def check(errors, condition, fmt, *info):if not condition:errors.append((fmt, info))
Then come the actual generators for the above parts.
def enc_bytes(org, block, group, errors, args):for a in args:a = resolve(org + len(block), a)check(errors, 0 <= a <= 0xFF, "bytes {}", hex(a))block.append(chr(a))return Truedef enc_halfs(org, block, group, errors, args):for a in args:a = resolve(org + len(block), a)check(errors, 0 <= a <= 0xFFFF, "halfs {}", hex(a))block.append(chr((a >> 0) & 255))block.append(chr((a >> 8) & 255))return Truedef enc_words(org, block, group, errors, args):for a in args:a = resolve(org + len(block), a)check(errors, 0 <= a <= 0xFFFFFFFF, "words {}", hex(a))block.append(chr((a >> 0) & 255))block.append(chr((a >> 8) & 255))block.append(chr((a >> 16) & 255))block.append(chr((a >> 24) & 255))return Truedef enc_quads(org, block, group, errors, args):for a in args:a = resolve(org + len(block), a)check(errors, 0 <= a <= 0xFFFFFFFFFFFFFFFF, "quads {}", hex(a))block.append(chr((a >> 0) & 255))block.append(chr((a >> 8) & 255))block.append(chr((a >> 16) & 255))block.append(chr((a >> 24) & 255))block.append(chr((a >> 32) & 255))block.append(chr((a >> 40) & 255))block.append(chr((a >> 48) & 255))block.append(chr((a >> 56) & 255))return Truedef enc_string(org, block, group, errors, string):block.extend(string.encode('utf-8'))return Truedef enc_check_size(org, block, group, errors, size):real_sz = org + len(block) - resolve(org + len(block), group)sz = resolve(org + len(block), size)check(errors, real_sz == sz, "size mismatch {} != {}", real_sz, sz)return True
For your own good, check the input ranges and treat signed/unsigned quantities as different types.
Next we got a fairly simple tool that we're using a little in building the thing, but not actually otherwise.
# Can be used to gauge if you got it right.def str_template(template):vector = bin(template)[2:34].rjust(32, '0')return " ".join([vector[0:7], vector[7:12], vector[12:17],vector[17:20], vector[20:25], vector[25:]])return vector
To show you what this is doing.
>>> str_template(0)'0000000 00000 00000 000 00000 0000000'
This is the "frame" that 32-bit long RISC-V instructions use.
We can use it to check whether we've provided the right encoding.
For example, the BGEU should be 0x7063 and it's using the B-type filling.
'0000000 00000 00000 111 00000 1100011'The instruction frames themselves follow the form. For example, here's the J-type filling.
'_______ _____ _____ ___ XXXXX _______' rd register'X______ _____ _____ ___ _____ _______' imm bit 20'_XXXXXX XXXX_ _____ ___ _____ _______' imm bits 10:1'_______ ____X _____ ___ _____ _______' imm bits 11'_______ _____ XXXXX XXX _____ _______' imm bits 19:12
What are the shifting offsets and masks?
25 20 15 12 7 0'_______ _____ _____ ___ _____ XXXXXXX' 0x0000007F'_______ _____ _____ ___ XXXXX _______' 0x00000F80'_______ _____ _____ XXX _____ _______' 0x00007000'_______ _____ XXXXX ___ _____ _______' 0x000F8000'_______ XXXXX _____ ___ _____ _______' 0x01F00000'XXXXXXX _____ _____ ___ _____ _______' 0xFE000000
So we can use this to construct the correct encoding for J-type filling and see that it holds:
>>> imm = 0x1FFFFF>>> imm_20 = (imm << 11) & 0x80000000>>> imm_10_1 = (imm << 20) & 0x7FE00000>>> imm_11 = (imm << 9) & 0x00100000>>> imm_19_12 = imm & 0x000FF000>>> str_template(imm_20)'1000000 00000 00000 000 00000 0000000'>>> str_template(imm_10_1)'0111111 11110 00000 000 00000 0000000'>>> str_template(imm_11)'0000000 00001 00000 000 00000 0000000'>>> str_template(imm_19_12)'0000000 00000 11111 111 00000 0000000'
Just check that the sign bit gets on the right place.
Here's the program that encodes the RISC-V instructions. Don't trust the encodings. I may have made a mistake.
class I32Enc(object):def __init__(self, template, fn, *values):self.template = templateself.fn = fnself.values = valuesdef encode(self, org, block, group, errors):a = self.template | self.fn(org, block, group, errors, self.values)block.append(chr((a >> 0) & 255))block.append(chr((a >> 8) & 255))block.append(chr((a >> 16) & 255))block.append(chr((a >> 24) & 255))return Truedef r_type(org, block, group, errors, (rd, rs1, rs2)):assert 0 <= rd < 32assert 0 <= rs1 < 32assert 0 <= rs2 < 32return (rd << 7) | (rs1 << 15) | (rs2 << 20)def fence_type(org, block, group, (prec, succ)):imm = (prec & 0xF) << 4 | (succ & 0xF)return i_type(org, block, group, (0, 0, imm))def i_type(org, block, group, errors, (rd, rs1, imm)):assert 0 <= rd < 32assert 0 <= rs1 < 32imm = resolve(org + len(block), imm)check(errors, -2048 <= imm < 2048, "imm field (i_type) {}", imm)imm_11_0 = (imm << 20) & 0xFFF00000return imm_11_0 | (rd << 7) | (rs1 << 15)def s_type(org, block, group, errors, (rs1, rs2, imm)):assert 0 <= rs1 < 32assert 0 <= rs2 < 32imm = resolve(org + len(block), imm)check(errors, -2048 <= imm < 2048, "imm field (s_type) {}", imm)imm_11_5 = (imm << 20) & 0x7FF00000imm_4_0 = (imm << 7) & 0x00000F80return imm_11_5 | imm_4_0 | (rs1 << 15) | (rs2 << 20)def b_type(org, block, group, errors, (rs1, rs2, imm)):assert 0 <= rs1 < 32assert 0 <= rs2 < 32imm = resolve(org + len(block), imm) - (org + len(block))check(errors, -4096 <= imm < 4096, "imm field (b_type) {}", imm)assert imm & 1 == 0imm_12 = (imm << 19) & 0x80000000imm_10_5 = (imm << 15) & 0x7F000000imm_4_1 = (imm << 7) & 0x00000F00imm_11 = (imm >> 4) & 0x00000080return imm_12 | imm_10_5 | imm_4_1 | imm_11 | (rs1 << 15) | (rs2 << 20)def u_type(org, block, group, errors, (rd, imm)):assert 0 <= rd < 32imm = resolve(org + len(block), imm)assert imm & 0xFFF == 0return imm | (rd << 7)def j_type(org, block, group, errors, (rd, imm)):assert 0 <= rd < 32imm = resolve(org + len(block), imm) - (org + len(block))check(errors, -1048576 <= imm < 1048576, "imm field (j_type) {}", imm)assert imm & 1 == 0imm_20 = (imm << 11) & 0x80000000imm_10_1 = (imm << 20) & 0x7FE00000imm_11 = (imm << 9) & 0x00100000imm_19_12 = imm & 0x000FF000return imm_20 | imm_10_1 | imm_11 | imm_19_12 | (rd << 7)
Finally we provide the hi/lo -macros for splitting large values into lui/auipc and addi -combinations.
def calc_hi(value):if value > 0 and value & 2048 != 0:value = value + 4096if value < 0 and value & 2048 != 0:value = value + 4096return value ^ (value & 4095)def calc_lo(value):k = value & 4095if k >= 2048:return ~(~k & 4095)return khi = partial(Op, calc_hi)lo = partial(Op, calc_lo)# Self-check for hi/lo calculators.for x in range(-10000, 10000):assert calc_hi(x) + calc_lo(x) == xdef lli(rd, imm):rel = Group()offset = Op(operator.sub, imm, rel)rel.body.extend([auipc(rd, hi(offset)),addi(rd, rd, lo(offset)) ])return rel
Of course we do nothing with these without some encoding tables. Don't trust these either!
lui = partial(I32Enc, 0x37, u_type)auipc = partial(I32Enc, 0x17, u_type)jal = partial(I32Enc, 0x6F, j_type)jalr = partial(I32Enc, 0x67, i_type)beq = partial(I32Enc, 0x0063, b_type)bne = partial(I32Enc, 0x1063, b_type)blt = partial(I32Enc, 0x4063, b_type)bge = partial(I32Enc, 0x5063, b_type)bltu = partial(I32Enc, 0x6063, b_type)bgeu = partial(I32Enc, 0x7063, b_type)lb = partial(I32Enc, 0x0003, i_type)lh = partial(I32Enc, 0x1003, i_type)lw = partial(I32Enc, 0x2003, i_type)lbu = partial(I32Enc, 0x4003, i_type)lhu = partial(I32Enc, 0x5003, i_type)sb = partial(I32Enc, 0x0023, s_type)sh = partial(I32Enc, 0x1023, s_type)sw = partial(I32Enc, 0x2023, s_type)addi = partial(I32Enc, 0x0013, i_type)slti = partial(I32Enc, 0x2013, i_type)sltiu = partial(I32Enc, 0x3013, i_type)xori = partial(I32Enc, 0x4013, i_type)ori = partial(I32Enc, 0x6013, i_type)andi = partial(I32Enc, 0x7013, i_type)slli = partial(I32Enc, 0x00001013, r_type)srli = partial(I32Enc, 0x00005013, r_type)srai = partial(I32Enc, 0x40005013, r_type)add = partial(I32Enc, 0x00000033, r_type)sub = partial(I32Enc, 0x40000033, r_type)sll = partial(I32Enc, 0x00001033, r_type)slt = partial(I32Enc, 0x00002033, r_type)sltu = partial(I32Enc, 0x00003033, r_type)xor = partial(I32Enc, 0x00004033, r_type)srl = partial(I32Enc, 0x00005033, r_type)sra = partial(I32Enc, 0x40005033, r_type)or_ = partial(I32Enc, 0x00006033, r_type)and_ = partial(I32Enc, 0x00007033, r_type)fence = partial(I32Enc, 0x000F, fence_type)fence_i = words(0x100F)ecall = words(0x000073)ebreak = words(0x100073)csrrw = partial(I32Enc, 0x1073, i_type)csrrs = partial(I32Enc, 0x2073, i_type)csrrc = partial(I32Enc, 0x3073, i_type)csrrwi = partial(I32Enc, 0x5073, i_type)csrrsi = partial(I32Enc, 0x6073, i_type)csrrci = partial(I32Enc, 0x7073, i_type)
That's the whole thing for now. In practice you probably want to distinguish between register and constant arguments. In some instructions we should give them additional annotations because semantics. For example, in the SLLI, SRLI, SRAI instructions rs2 is shift-amount. The position in the argument list doesn't change, but the type is entirely different.
Here's still a little treat you may appreciate if you try to examine the bit fiddling in this program.
If you don't happen to remember how to convert between binary and hexadecimals. It should be easy by dividing the problem into quadrants:
00xx 01xx 10xx 11xxxx00 0 4 8 Cxx01 1 5 9 Dxx10 2 6 A Exx11 3 7 B F
Trace left to obtain the lower two digits and up to obtain the higher two digits for the hexadecimal digit.
So if you were to look at what binary a hexadecimal refers to, go "zero, four, eight, C" to first increment left side, then increment by one until you get to the hexadecimal digit you have.
Likewise, you can examine two first digits of a 4-group binary and then the second two digits to table the hexadecimal digit for yourself.
To negate a number, you flip the bits around and increment it by one.
Eg. ~x + 1.
The riscv-diy-asmcc contains the assembler presented here. If you want the exact source code for the code presented here, it's under the tag 2019_feb_25_blogpost.
Writing your own compiler in Python
There's a paper "Near-Optimal Instruction Selection on DAGs" that I've read and implemented at least once.
Linear register allocation or Chaitin-Briggs algorithm is probably needed as well.
There's also still the neat paper by Jorge Navas, "Horn-Clauses as an Intermediate Representation for Program Analysis". that I found interesting as a compiler writer.
The blog post is starting to be 7000 words long by now. I guess these are subjects for the another time.
Machine-readable specification
I absolutely love machine-readable specifications. Compilers are the first places where there should be lots and lots of them. There shouldn't really exist any single system software you can't or shouldn't be able to reimplement yourself with reasonable time and knowledge.
Projects such as implementing your own compiler and toolchain doesn't need to be just hobby projects, thanks to specifications that can be read by a machine. I did few useful tools to cater my previous programming language that anyone possibly wanting to write this kind of stuff might like to know.
The vkstruct provide some python scripts and macros for examining Vulkan graphics API's XML-formatted machine specs. It produces a json file that I used to implement primitive wrappers in Lever for Vulkan spec.
The cffi-gen is a set of tools to parse and examine C headers. Idea is that you write a Python script that implements few rewriting rules to produce a clean json-formatted header file that you can use to drive an FFI.
These haven't been very popular projects because it turns out lots of people love doing repetitive tasks and something like knitting but with a computer. Everybody loves knitting socks but few love making electronic knitting machines that make socks. Too bad those handmade SDL2 bindings do not warm your feet, yet they stink and are buggy!
Hardware to buy
When you finally get something fun done, you'd probably want hardware for it. There's sifive's hifive arduino board, the chinese sipeed/kendryte. NXP is making boards.
I guess I'll buy some board at some point, would prefer one with 5V GPIO and ADC ports. Doesn't harm if it has video output built in. If you wonder what I'm going to do, well.. It suffices to say I've been watching 8bitguy. He's got some real influence on people. Also been Arduino programming variety of things.
Who would like to follow this tutorial?
It's possible you just read it through for fun. But is your own compiler/assembler writing possibly something more than just a fun exercises? I don't know.
This opens up possibilities and ways to do things that have been closed off from most people for three decades now. Here I just exclaim you that "Hey! You can do this again!".
If you ask me, I didn't want to end up having to use common tools because I cannot make my own. I think with GCC and LLVM it's been steering to that direction for a while. I want to be in an situation that I use common tools because they're so good that there's no need to make better myself.