80386 Early Start Memory Access

80386 Early Start Memory Access - Small Things Retro

When Intel designed the 80386, they gave it a trick for hiding memory latency: Early Start . Instead of waiting for an instruction to reach its memory micro-op, the 386 begins the next instruction's address work — effective address, segment relocation, the bus cycle — in the last cycle of the current instruction. Intel put it at about 9% of overall performance. It is also the source of the POPAD bug.

The z386 FPGA core I released in May ran the original 386 microcode but didn't have early start. Over the last month I added it along with a series of other optimizations, and z386 now reaches ao486-class performance:

core Doom (FPS) 3DBench Landmark

z386 0.1 (May) 16.6 33.7 147

z386 0.4 (June) 23.0 44.5 170

ao486 21.0 43.8 204

Doom (original, max details) went up ~39% (16.6 → 23.0), past ao486's 21.0, and the 16-bit 3DBench now edges past ao486 too. The board clock is unchanged from v0.1's 85 MHz, so the gains came entirely from cutting CPI , doing more work per clock. Per-instruction, z386 went from well above the 386's cycle counts to at or below them on nearly everything:

Instruction timings: z386 0.1 → 0.4 vs the original 80386.

The memory pipeline post earlier in this series introduced Early Start as a concept. This post is about building it on an FPGA, plus the rest of the CPI work that got z386 to parity.

Early Start

Intel discussed Early Start in Slager's ICCD '86 paper, "Performance Optimizations of the 80386". The clue to how it works is in the microcode. Here is the entry for an ALU instruction that reads a memory operand (ADD reg, [mem]):

; ADD/OR/ADC/SBB/AND/SUB/XOR m,r 04A EFLAGS -> FLAGSB FLGSBA RD 9 04B DLY 04C OPR_R -> TMPB WRITE_RESULT JMP UNL 04D TMPB SRCREG +-&|^

The interesting thing is that the first micro-instruction, 04A, already issues RD — it starts the memory read. No micro-instruction before it computes the effective address, adds the segment base, or checks the limit. Address generation is implicit , done by hardwired logic. A concrete example makes this clearer:

add eax, 16 mov ebx, [eax+4]

In execution order, the microcode runs as in the table below. Line 023 runs the ALU (EAX + 16) and asserts RNI — "run next instruction" — so the machine is already committed to starting MOV r,m next. Line 024 writes the result back into EAX, and that same 024 cycle is the early-start window for the next instruction (the load):

cycle add eax, 16 mov ebx, [eax+4]

023: EAX + 16 in the ALU, RNI

024: write EAX (= old EAX + 16) early-start window : peek at the next instruction, forward the just-produced EAX, compute EA = EAX + 4, relocate, and issue RD

019: RD microcode

01A: DLY data arriving, write OPR_R

01B: RNI

01C: OPR_R -> EBX

This overlap starts the memory access at least one cycle earlier, cutting load/store latency. The subtlety is that the previous instruction's last micro-instruction may write back to a register, creating a data hazard. Here EAX is being written in that very cycle, so its new value isn't in the register file yet. The fix is the usual one — a forwarding network, so early-start sees the latest value. The 386DX's forwarding network had a corner-case bug that produced the famous POPAD bug : when POPAD is followed by an instruction using [EAX+...], the early-start machinery forwards the wrong value.

Another way to view early-start is coarse pipelining at the granularity of macro-instructions, where the last cycle of the previous instruction (RNI delay slot) is the write-back stage of that instruction, and it overlaps with the next instruction's first cycle, the early-start cycle.

Implementing Early Start

z386 tracks each instruction through a small lifecycle. The two events that matter here are i_pop — the cycle the instruction is pulled from the prefetch queue, which is the previous instruction's RNI delay slot — and i_first , the first cycle of its own microcode. i_pop is exactly the 386's early-start window in cycle 2 above.

So early start, in z386, is: compute the effective address and linear address combinationally at i_pop, forwarding the in-flight register write. The decoder produces the base/index/displacement selectors, and:

wire [31:0] ea_early = calc_ea_core(fwd_onehot_gpr(ea_dec_base_sel_r), fwd_onehot_gpr(ea_dec_index_sel_r), ...);

fwd_onehot_gpr is the bypass. If the previous instruction's delay-slot writeback targets the EA's base or index register, it substitutes the writeback value (dest_value) for the register-file copy — handling byte, word, and dword writes separately, because a partial write only updates part of the register:

FWD_BLO: fwd_onehot_gpr = {cur[31:8], dest_value[7:0]}; // AL FWD_W: fwd_onehot_gpr = {cur[31:16], dest_value[15:0]}; // AX default: fwd_onehot_gpr = dest_value; // EAX

Stack pointers get the same treatment through forwarded_esp, so a push right after an instruction that adjusts ESP still sees the new value. ea_early is then registered into ea_reg at...

80386 Early Start Memory Access

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI