Mill CPU Architecture Overview

I've studied the Mill CPU architecture by Mill Computing, Inc. for the last week. I Watched their video lectures, read their forums and asked few questions there. Based on what I've seen this far I'm excited about where this is going to.

The Mill looks like it would be really simple architecture to work with. In intel or ARM hardware there are exceptions over exceptions, with system programmer manuals that are thousands pages long. I feel it would be difficult to fill up 100 pages to tell about the system programming in Mill. There aren't the usual kludges that usually are required, the Mill dances around them with clever design.

Why does it matter what kind of architecture are you running on? I can point out few reasons. Porting platform-specific things are easier less obstacles there are. Troubleshooting complex systems will be easier, more reliable an expectable the system is. Abstraction introduced while programming always tend to leak details from the implementation. Limitations and kludges in the hardware level will reflect to higher layers over time.

It is probably interesting to know why I think it's going to be one of the easy architectures to program, if you get to do it. I do not know by advance who will buy these things. Also The Mill Computing Inc. has told everywhere that they are at least a year away from a product they could market, so anything could happen in between. It's an interesting architecture and valuable to review in it's own right though.

Program loading

Mill processors are meant to run existing programs without a rewrite. Compilers, operating system kernels and such will require a port though. That porting will be easy!

Whenever a compiler or JIT compiles for the Mill target, the output is a graph containing operators from the abstract Mill operation set. The architecture specifies a separate loader format, which is a serialized forest of data- and control-flow graphs. When the program loads, those graphs will be fed to a specializer service, which translates the graphs into instruction blocks. The translation happens nearly in linear time relative to the graph size.

Every CPU in the Mill processor family varies by the instruction set encoding. The encoding is generic, but the bits will mean different things in different family members. This kind of organization allows the platform to extend near-indefinitely, yet the programming model keeps staying simple.

The belt model

Mill is based on a new machine model. In this machine model there are no program-accessible registers. Instead there's a belt. Every operation pushes a value into the beginning of the belt, and old values drop out of the end of the belt. Every operation can access any value in the belt. Some values are living long, in that case they end up spilled into a scratchpad and retrieved whenever they're needed.

The model reminds me of linear scan register allocators, that were popular in JIT compiler implementations because they completed in shorter time than algorithms based on graph coloring. Programs create lots of values that live a cycle or two and then expire. In Mill that has been taken into heart in hardware design. The designers also have ended up to an interesting conclusion with global state as they do not have it. Lack of an explicit global state on the hardware level makes the whole system more predictable, more reliable and easier to understand. Coincidentally the effect is same as on the source code level.

Metadata

The global state in the processor has been eliminated. There are no global condition flags either. Whenever a value gets loaded into the belt, it gets along metadata. It tells the bit-width, vectorization, and flags related to the value. Whenever an operation calculates something with the value, it drops a new value into the belt with updated metadata. The metadata is used for speculative execution, loop pipelining, conditional branching and local error recovery.

Speculative execution

Even in a generic purpose processors, branching is expensive. You'll want to eliminate branching if there's an equivalent expression that does the same thing in shorter time. Mill is a MIMD machine. It means that the instructions are very wide containing many operations that access lots of data. A lot of time you would just want to compute values early and then just pick either one instead of branching. Mill architecture has special mechanism that makes the speculative code equivalent to the conditional version. So they're interchangeable. The trick is to store a flag NaR, with source address annotated in, whenever a bad value is generated. If the NaR reaches non-speculable operation, the system faults. There's also a concept of None, which turns operations into NOPs if encountered, even within vector operations.

Hardware supported function calls

Every function has their own belt and a scratchpad, and they're maintained by the hardware. Function calls are maintained by the hardware as well! The call operation takes the arguments from the belt, delivers them to the callees belt and returns the values from the callee into the caller's belt. The same unit that spills and fills between the belt and scratchpad also takes care of the return addresses. The scratchpad is protected from the program itself. This eliminates the stack smash exploits that are the most oldest and most reliable way method for attackers to gain access to a remote computer. The hardware management of the stacks also prevents malicious programs from inspecting a rubble in the stack. Whenever the stack is expanded it is always cleared to zero by the hardware.

I asked about how the green threads and coroutines are implemented in such system. I were answered that they're supported in the hardware as well, but the patent hasn't yet been filed so they couldn't tell me. Implementation of them in a language should be simple and efficient then. If this is true, then very complex asynchronous software could be written in very simple manner for Mill, even if there were high performance requirements.

One interesting tidbit is that the hardware interrupts for Mill are just function calls as well. Very predictable and logical.

Single virtual address space

Mill is a shared memory system with a shared virtual address space. If there's a value in address f00d, it's in there, and switching a thread doesn't change that fact. The benefit for programmability is that the memory hierarchy is extremely easy to understand from the user perspective. How Mill manages memory is not going to be a mystery to anyone who's curious enough to figure it out. It can be explained in few sentences:

Whenever a load occurs that needs to go to a dram, first there are some layers of caches and protection look-ahead buffers. PLB checks whether the load deserves a NaR. If the load goes deeper then the whole system goes through translation lookaside buffer, which translates virtual addresses into physical addresses that go to the device controllers. The first cache level is hardward architecture, which means there's separate cache for instructions and data. Your program only sees virtual addresses in a single, 60 bits wide address space. That means all your running Mill programs and services must fit into an exabyte.

Turf & Region security model

Mill security model is based on a concept of region. Every region is marked by a region descriptor, which tells which turf or thread is allowed to access it and what rights are granted. The region is granular on the byte level so access can be granted to small areas for other processes. The newly granted regions pass through the PLB, as it is expected that most of them are short-lived.

Turf is equivalent to a process in Mill, but it just consists of regions that belong to the turf. The turfs and threads go independently, but both of them have some well known regions descriptors, that are stored in registers to avoid excessive access into the PLB.

Security philosophy in Mill is that every part of the system is equal, and one application can only see what others pass to it. There are no third parties in between enforcing it, other than the hardware.

Portals

It is possible to grant an access into a function. That is, function calls can extend over turfs under strict control. Before a portal call the program can grant transient access to data it wants to expose over a function call. Such calls cost same as normal function calls plus two cache fetches. During the call, both callee and caller are protected from each other. Again a mechanism that is very easy to predict and troubleshoot.

Microkernel-ready

Because over-the-turf calls are cheap, the Mill favours a microkernel. Instead of having a single kernel mode, different functions and drivers can be isolated into their own turfs. Such system would be, if not more secure then more reliable than the monolithic kernels. It might be easier to develop drivers too, since you know they hardly break the remaining system even if you wanted them to.

Power-up

There exists no supervisor mode. In fact the operating system does not have any additional privileges at all. It is also merely a turf and a thread just like the rest of the system. Instead, the system starts in a special turf, called All, which grants access to the whole virtual address space.

Finally

In short, I haven't seen anything like Mill before. I learned a lot more than this, of course. But those other things seemed to just concentrate on how it reaches the unusual performance. They claim it's 10 times more performance/power efficient than the now common out-of-order superscalar processors.

The processor seems like it would greatly improve performance of higher order programming languages, such as javascript or lisp. I suspect we'll see some similar features introduced later on. If that's true, it might be an ideal architecture for mobile phones as well as desktops that have become filled with web applications of different kinds.