SBCL: The Assembly Code Breadboard

yacin1 pts0 comments

SBCL: the ultimate assembly code breadboard - Paul Khuong: some Lisp

EDIT: Lutz Euler points out that the NEXT sequence (used to) encode<br>an effective address with an index register but no base. The mistake<br>doesn’t affect the meaning of the instruction, but forces a wasteful<br>encoding. The difference in machine code are as follows.

Before (14 bytes):

; 03: 8B043D00000000 MOV EAX, [RDI] ; _5_ useless bytes!<br>; 0A: 4883C704 ADD RDI, 4<br>; 0E: 4801F0 ADD RAX, RSI<br>; 11: FFE0 JMP RAX

Now (9 bytes):

; 93: 8B07 MOV EAX, [RDI]<br>; 95: 4883C704 ADD RDI, 4<br>; 99: 4801F0 ADD RAX, RSI<br>; 9C: FFE0 JMP RAX

I fixed the definition of NEXT, but not the disassembly snippets<br>below; they still show the old machine code.

Earlier this week, I took another look at the<br>F18. As usual with Chuck Moore’s<br>work, it’s hard to tell the difference between insanity and mere<br>brilliance ;) One thing that struck me is how small the stack is: 10<br>slots, with no fancy overflow/underflow trap. The rationale is that,<br>if you need more slots, you’re doing it wrong, and that silent<br>overflow is useful when you know what you’re doing. That certainly<br>jibes with my experience on the HP-41C and with x87. It also reminds<br>me of a<br>post of djb’s decrying our misuse of x87’s rotating stack:<br>his thesis was that, with careful scheduling, a “free” FXCH makes<br>the stack equivalent – if not superior – to registers. The post<br>ends with a (non-pipelined) loop that wastes no cycle on shuffling<br>data, thanks to the x87’s implicit stack rotation.

That lead me to wonder what implementation techniques become available<br>for stack-based VMs that restrict their stack to, e.g., 8 slots.<br>Obviously, it would be ideal to keep everything in registers.<br>However, if we do that naïvely, push and pop become a lot more<br>complicated; there’s a reason why Forth engines usually cache only the<br>top 1-2 elements of the stack.

I decided to mimic the x87 and the F18 (EDIT: modulo the latter’s two<br>TOS cache registers): pushing/popping doesn’t cause any data movement.<br>Instead, like the drawing below shows, they decrement/increment a modular<br>counter that points to the top of the stack (TOS). That would still be<br>slow in software (most ISAs can’t index registers). The key is that<br>the counter can’t take too many values: only 8 values if there are 8<br>slots in the stack. Stack VMs already duplicate primops for performance<br>reasons (e.g., to help the BTB by spreading out execution of the same<br>primitive between multiple addresses), so it seems reasonable to specialise<br>primitives for all 8 values the stack counter can take.

In a regular direct threaded VM, most primops would end with a code<br>sequence that jumps to the next one (NEXT), something like<br>add rsi, 8 ; increment virtual IP before jumping<br>jmp [rsi-8] ; jump to the address RSI previously pointed to<br>where rsi is the virtual instruction pointer, and VM instructions<br>are simply pointers to the machine code for the relevant primitive.

I’ll make two changes to this sequence. I don’t like hardcoding<br>addresses in bytecode, and 64 bits per virtual instruction is overly<br>wasteful. Instead, I’ll encode offsets from the primop code block:<br>mov eax, [rsi]<br>add rsi, 4<br>add rax, rdi<br>jmp rax<br>where rdi is the base address for primops.

I also need to dispatch based on the new value of the implicit stack<br>counter. I decided to make the dispatch as easy as possible by<br>storing the variants of each primop at regular intervals (e.g. one<br>page). I rounded that up to 64 * 67 = 4288 bytes to minimise<br>aliasing accidents. NEXT becomes something like<br>mov eax, [rsi]<br>add rsi, 4<br>lea rax, [rax + rdi + variant_offset]<br>jmp rax

The trick is that variant_offset = 4288 * stack_counter, and the<br>stack counter is (usually) known when the primitive is compiled. If<br>the stack is left as is, so is the counter; pushing a value decrements<br>the counter and popping one increments it.

That seems reasonable enough. Let’s see if we can make it work.

Preliminaries

I want to explore a problem for which I’ll emit a lot of repetitive<br>machine code. SLIME’s REPL and SBCL’s assembler are perfect for the<br>task! (I hope it’s clear that I’m using unsupported internals; if it<br>breaks, you keep the pieces.)

The basic design of the VM is:

r8-r15: stack slots (32 bits);

rsi: base address for machine code primitives;

rdi: virtual instruction pointer (points to the next instruction);

rax,rbx,rcx,rdx: scratch registers;

rsp: (virtual) return stack pointer.

10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>18<br>19<br>20<br>21<br>22<br>23<br>24<br>25<br>26<br>27<br>28<br>29<br>30<br>31<br>32<br>33<br>34<br>35<br>(import '(sb-assem:inst sb-vm::make-ea)) ; we'll use these two a lot

;; The backing store for our stack<br>(defvar *stack* (make-array 8 :initial-contents (list sb-vm::r8d-tn<br>sb-vm::r9d-tn<br>sb-vm::r10d-tn<br>sb-vm::r11d-tn<br>sb-vm::r12d-tn<br>sb-vm::r13d-tn<br>sb-vm::r14d-tn<br>sb-vm::r15d-tn)))

;; The _primop-generation-time_ stack pointer<br>(defvar *stack-pointer*)

;; (@ 0) returns the (current) register for TOS, (@ 1) returns<br>;; the one just below, etc.<br>(defun @ (i)<br>(aref *stack* (mod (+ i...

stack code counter next instruction machine

Related Articles