Interlude: Using the Index Registers Effectively on the Z80 | Bumbershoot Software
Working on these Spectrum projects has meant more practice with Z80 assembly along with everything else, and while I do think I’m basically at parity between Z80 and 6502 now in terms of my skills, the grind of actually doing things with it to get practical experience is still very necessary. Implementing the line-drawing algorithm required me to stretch my skills in new directions and put in some real work on both program design, instruction-level micro-optimizations, and in seeing just how far I could push my preferred assembler.
This was great, as practice: I had to really sit down and grind through what options are all available and when and how I could best use each option. I had to consider multiple ways of phrasing everything. I did in fact come up with some real uses for multiple idioms that were new to me in my own code. I could find that all of this still stayed basically consonant with my big CPU comparison last month.
I still have no proper illustration for the act of studying and hand-optimizing assembly code, so once again, enjoy the main menu screen of EXApunks (Zachtronics, 2018).
But then, the punchline: in practice, as opposed to as practice, none of those idioms ended up being worthwhile for the actual line-drawing function. After using them to write the function in the first place and then testing them out, I started tuning performance and every single one got removed.
That’s a quirk of this function, though. Wrestling with the alternatives here lays out the space neatly, and even though there’s no final work to show, there’s still going to be some value in showing my work.
Moving Beyond the Z80’s Register Space
Our core problem here is that our computation requires more memory than we can fit into registers, so we need to keep some of our state in memory. The nature of the Z80 instruction set is such that we generally cannot work with data unless it’s in registers, so this is much more a case of stashing values we don’t need just now with the intent of recovering them later. (The alternative may be seen in the 8086 and the 68000, where you can use values from the stack nearly as freely as you can the values within your registers, as sources or destinations.) In my Z80 projects to date, I have generally been able to rely on one of four techniques for managing computations. From least to most complex:
The entire computation fits in the register bank. Any persistent data is stored in global variables accessed via absolute address. My implementation of the 16-bit Xorshift PRNG works like this.
As above, but I want the function to not trash certain registers. This works as above but there is now a function prologue and epilogue that PUSH and POP the registers I wish to preserve. My SG-1000 BIOS library does this a great deal.
As above, but not at function boundaries. We can use PUSH and POP anywhere to make room for temporary values and then restore the values those temporaries displaced. My LZ4 decompressor uses this technique to stash the lengths of the various runs in the compressed data and recover them as needed.
As above, but we have exactly two 16-bit temporary values and we alternate between them before restoring the values we displaced. The Z80’s EX (SP),HL instruction lets us perform this alternation without disrupting any other registers. The LZ4 decompressor uses this to juggle the compressed data pointer and the backreference pointer during backreference processing.
With Bresenham’s algorithm, we finally reach the final stage:
There is more data than can fit into registers and the values persist sufficiently that we cannot treat some of them as "temporaries". In this case we must set aside memory to hold the computation state and sync registers with memory as necessary. The Z80 instruction set grants us the most freedom of action if this block of memory is a collection of contiguous global variables at a fixed location in RAM.
We have actually already seen an example of "a collection of contiguous global variables at a fixed location in RAM:" the ZX Spectrum’s system variables. It gives itself even more flexibility by assigning IY to a fixed location that allows it to reach all of them within its index range; for our own code, we may also do this with IX if we so desire.
Ways to Access Memory
The Z80 has a dizzying array of ways to move data around and it isn’t always obvious which ones are more performant than others:
An 8-bit register transfer LD r8,r8 is 1 byte and 4 cycles.
An 8-bit immediate load LD r8,n is 2 bytes and 7 cycles.
An 8-bit memory load or store through a 16-bit pointer register LD r8,(HL) , LD (HL),r8 , LD A,(HL/DE/BC) , or LD (HL/DE/BC),A is 1 byte and 7 cycles.
An 8-bit memory load or store through an index register LD r8,(Ir+d) or LD (Ir+d),r8 costs 3 bytes and 19 cycles.
An 8-bit memory load or store through a direct address LD A,(nn) or LD (nn),A...