z386: An Open-Source 80386 Built Around Original Microcode - Small Things Retro
This is the fifth installment of the 80386 series. The FPGA CPU is now far enough along to run real software, and this post is about how it works. z386 is a 386-class CPU built around the original Intel microcode, in the same spirit as z8086.
The core is not an instruction-by-instruction emulator in RTL. The goal is to recreate enough of the original machine that the recovered 386 control ROM can drive it. Today z386 boots DOS 6 and DOS 7, runs protected-mode programs like DOS/4GW and DOS/32A, and plays games like Doom and Cannon Fodder. Here are some rough numbers against ao486:
Metric<br>z386<br>ao486
Lines of code (cloc)<br>8K<br>17.6K
ALUTs<br>18K<br>21K
Registers<br>5K<br>6.5K
BRAM<br>116K<br>131K
FPGA clock<br>85MHz<br>90MHz
3DBench FPS<br>34<br>43
Doom (original) FPS, max details<br>16.5<br>21.0
In current builds, z386 performs like a fast (~70MHz) cached 386-class machine, or a low-end 486. It runs at a much higher clock than historical 386 CPUs, but with somewhat worse CPI (cycles per instruction). The current cache is a 16 KB, 4-way set-associative unified L1, chosen partly to keep the clock high. Real high-end 386 systems often used larger external caches, typically in the 32 KB to 128 KB range.
Doom II running on z386.
Much of this 386 microarchitecture archaeology has already been covered in the previous four posts: the multiplication/division datapath, the barrel shifter, protection and paging, and the memory pipeline. z386 tries to be both an educational reconstruction and a usable FPGA CPU. It keeps many 386-like structures: a 32-entry paging TLB, a barrel shifter shaped like the original, ROM/PLA-style decoding, the Protection PLA model, and most importantly the 37-bit-wide, 2,560-entry microcode ROM. At the same time, it uses FPGA-friendly shortcuts where they make sense, such as DSP blocks for multiplication and the small fast L1 cache.
In this post, I will fill in the rest of the design: instruction prefetch, decode, the microcode sequencer, cache design, testing, how z386 differs from ao486, and some lessons from the bring-up.
From z8086 to z386
A little background first. Last year I wrote z8086, an original-microcode-driven 8086, based on reenigne's disassembly work. That project showed that it was possible to build a working CPU around recovered microcode. Towards the end of the year, I learned that 80386 microcode had recently been extracted, and that reenigne and several others — credited at the end of this post — were working on a disassembly. They generously shared their work with me, and z386 started from there.
The 386 is a very different problem from the 8086. The instruction set is larger, the internal state is much richer, and the machine has to enforce protection, paging, privilege checks, and precise faults. More importantly, the 80386 micro-operations are denser and more contextual. If the 8086 microcode reads like a straightforward C program, the 386 microcode reads more like hand-tuned assembly: short, subtle, and full of assumptions about hidden hardware.
That puzzle took about four months of evenings and weekends. The result is not a perfect 386 yet, but it is now far enough along to run real protected-mode DOS software.
z386 - high-level view
At a high level, the 386 is organized around eight major units. z386 follows the same division closely enough that the original Intel block diagram is still a useful map.
The 80386 as eight cooperating units.
Source: Intel, The Intel 80386 - Architecture and Implementation, Figure 8.
The diagram actually maps quite well to the actual 386 die shot, although the relative positions of the units are different.
The same eight-unit organization on the 80386 die.
Base image: Intel 80386 DX die, Wikimedia Commons.
Here is what those units do in z386:
1. Prefetch unit. Keeps a 16-byte code queue filled from memory. Branches, faults, interrupts, and segment changes can flush and restart it.
2. Decoder. Consumes instruction bytes, tracks prefixes, recognizes ModR/M and SIB forms, gathers immediates and displacements, and maps instructions to microcode entry points.
3. Microcode sequencer. Fetches expanded microcode words, handles jumps, delay slots, faults, and run-next-instruction behavior.
4. ALU and shifter. Implements arithmetic, logic, flags, bit operations, shifts, rotates, multiplication, and division support.
5. Segmentation unit. Computes logical-to-linear addresses, applies segment bases and limits, and stores the hidden descriptor-cache state.
6. Protection unit. Recreates the 386 Protection PLA behavior for selector and descriptor validation.
7. Paging unit. Handles TLB lookup, page walks, Accessed/Dirty updates, page faults, and the transition from linear to physical addresses.
8. BIU/cache/memory path. Connects CPU memory operations to paging, cache, SDRAM, ROM, I/O, and the surrounding PC system.
This organization is quite different from the tidy pipelines...