SBCL: The Assembly Code Breadboard

SBCL: the ultimate assembly code breadboard - Paul Khuong: some Lisp

EDIT: Lutz Euler points out that the NEXT sequence (used to) encode an effective address with an index register but no base. The mistake doesn’t affect the meaning of the instruction, but forces a wasteful encoding. The difference in machine code are as follows.

Before (14 bytes):

; 03: 8B043D00000000 MOV EAX, [RDI] ; _5_ useless bytes! ; 0A: 4883C704 ADD RDI, 4 ; 0E: 4801F0 ADD RAX, RSI ; 11: FFE0 JMP RAX

Now (9 bytes):

; 93: 8B07 MOV EAX, [RDI] ; 95: 4883C704 ADD RDI, 4 ; 99: 4801F0 ADD RAX, RSI ; 9C: FFE0 JMP RAX

I fixed the definition of NEXT, but not the disassembly snippets below; they still show the old machine code.

Earlier this week, I took another look at the F18. As usual with Chuck Moore’s work, it’s hard to tell the difference between insanity and mere brilliance ;) One thing that struck me is how small the stack is: 10 slots, with no fancy overflow/underflow trap. The rationale is that, if you need more slots, you’re doing it wrong, and that silent overflow is useful when you know what you’re doing. That certainly jibes with my experience on the HP-41C and with x87. It also reminds me of a post of djb’s decrying our misuse of x87’s rotating stack: his thesis was that, with careful scheduling, a “free” FXCH makes the stack equivalent – if not superior – to registers. The post ends with a (non-pipelined) loop that wastes no cycle on shuffling data, thanks to the x87’s implicit stack rotation.

That lead me to wonder what implementation techniques become available for stack-based VMs that restrict their stack to, e.g., 8 slots. Obviously, it would be ideal to keep everything in registers. However, if we do that naïvely, push and pop become a lot more complicated; there’s a reason why Forth engines usually cache only the top 1-2 elements of the stack.

I decided to mimic the x87 and the F18 (EDIT: modulo the latter’s two TOS cache registers): pushing/popping doesn’t cause any data movement. Instead, like the drawing below shows, they decrement/increment a modular counter that points to the top of the stack (TOS). That would still be slow in software (most ISAs can’t index registers). The key is that the counter can’t take too many values: only 8 values if there are 8 slots in the stack. Stack VMs already duplicate primops for performance reasons (e.g., to help the BTB by spreading out execution of the same primitive between multiple addresses), so it seems reasonable to specialise primitives for all 8 values the stack counter can take.

In a regular direct threaded VM, most primops would end with a code sequence that jumps to the next one (NEXT), something like add rsi, 8 ; increment virtual IP before jumping jmp [rsi-8] ; jump to the address RSI previously pointed to where rsi is the virtual instruction pointer, and VM instructions are simply pointers to the machine code for the relevant primitive.

I’ll make two changes to this sequence. I don’t like hardcoding addresses in bytecode, and 64 bits per virtual instruction is overly wasteful. Instead, I’ll encode offsets from the primop code block: mov eax, [rsi] add rsi, 4 add rax, rdi jmp rax where rdi is the base address for primops.

I also need to dispatch based on the new value of the implicit stack counter. I decided to make the dispatch as easy as possible by storing the variants of each primop at regular intervals (e.g. one page). I rounded that up to 64 * 67 = 4288 bytes to minimise aliasing accidents. NEXT becomes something like mov eax, [rsi] add rsi, 4 lea rax, [rax + rdi + variant_offset] jmp rax

The trick is that variant_offset = 4288 * stack_counter, and the stack counter is (usually) known when the primitive is compiled. If the stack is left as is, so is the counter; pushing a value decrements the counter and popping one increments it.

That seems reasonable enough. Let’s see if we can make it work.

Preliminaries

I want to explore a problem for which I’ll emit a lot of repetitive machine code. SLIME’s REPL and SBCL’s assembler are perfect for the task! (I hope it’s clear that I’m using unsupported internals; if it breaks, you keep the pieces.)

The basic design of the VM is:

r8-r15: stack slots (32 bits);

rsi: base address for machine code primitives;

rdi: virtual instruction pointer (points to the next instruction);

rax,rbx,rcx,rdx: scratch registers;

rsp: (virtual) return stack pointer.

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 (import '(sb-assem:inst sb-vm::make-ea)) ; we'll use these two a lot

;; The backing store for our stack (defvar *stack* (make-array 8 :initial-contents (list sb-vm::r8d-tn sb-vm::r9d-tn sb-vm::r10d-tn sb-vm::r11d-tn sb-vm::r12d-tn sb-vm::r13d-tn sb-vm::r14d-tn sb-vm::r15d-tn)))

;; The _primop-generation-time_ stack pointer (defvar *stack-pointer*)

;; (@ 0) returns the (current) register for TOS, (@ 1) returns ;; the one just below, etc. (defun @ (i) (aref *stack* (mod (+ i...

SBCL: The Assembly Code Breadboard

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast