Computed goto for efficient dispatch tables - Eli Bendersky's website
Toggle navigation
Eli Bendersky's website
About
Projects
Archives
Recently, while idly browsing through the source code of Python, I came upon an interesting comment in the bytecode VM implementation (Python/ceval.c) about using the computed gotos extension of GCC [1]. Driven by curiosity, I decided to code a simple example to evaluate the difference between using a computed goto and a traditional switch statement for a simple VM. This post is a summary of my findings.
Defining a simple bytecode VM
First let's make clear what I mean by a "VM" in this context - a Bytecode Interpreter. Simply put, it's a loop that chugs through a sequence of instructions, executing them one by one.
Using Python's 2000-line strong (a bunch of supporting macros not included) PyEval_EvalFrameEx as an example wouldn't be very educational. Therefore, I'll define a tiny VM whose only state is an integer and has a few instructions for manipulating it. While simplistic, the general structure of this VM is very similar to real-world VMs. This VM is so basic that the best way to explain it is just to show its implementation:
#define OP_HALT 0x0<br>#define OP_INC 0x1<br>#define OP_DEC 0x2<br>#define OP_MUL2 0x3<br>#define OP_DIV2 0x4<br>#define OP_ADD7 0x5<br>#define OP_NEG 0x6
int interp_switch(unsigned char* code, int initval) {<br>int pc = 0;<br>int val = initval;
while (1) {<br>switch (code[pc++]) {<br>case OP_HALT:<br>return val;<br>case OP_INC:<br>val++;<br>break;<br>case OP_DEC:<br>val--;<br>break;<br>case OP_MUL2:<br>val *= 2;<br>break;<br>case OP_DIV2:<br>val /= 2;<br>break;<br>case OP_ADD7:<br>val += 7;<br>break;<br>case OP_NEG:<br>val = -val;<br>break;<br>default:<br>return val;
Note that this is perfectly "standard" C. An endless loop goes through the instruction stream and a switch statement chooses what to do based on the instruction opcode. In this example the control is always linear (pc only advances by 1 between instructions), but it would not be hard to extend this with flow-control instructions that modify pc in less trivial ways.
The switch statement should be implemented very efficiently by C compilers - the condition serves as an offset into a lookup table that says where to jump next. However, it turns out that there's a popular GCC extension that allows the compiler to generate even faster code.
Computed gotos
I will cover the details of computed gotos very briefly. For more information, turn to the GCC docs or Google.
Computed gotos is basically a combination of two new features for C. The first is taking addresses of labels into a void*.
void* labeladdr = &&somelabel;<br>somelabel:<br>// code
The second is invoking goto on a variable expression instead of a compile-time-known label, i.e.:
void* table[]; // addresses<br>goto *table[pc];
As we'll see shortly, these two features, when combined, can facilitate an interesting alternative implementation of the main VM loop.
To anyone with a bit of experience with assembly language programming, the computed goto immediately makes sense because it just exposes a common instruction that most modern CPU architectures have - jump through a register (aka. indirect jump).
The simple VM implemented with a computed goto
Here's the same VM, this time implemented using a computed goto [2]:
int interp_cgoto(unsigned char* code, int initval) {<br>/* The indices of labels in the dispatch_table are the relevant opcodes<br>*/<br>static void* dispatch_table[] = {<br>&&do_halt, &&do_inc, &&do_dec, &&do_mul2,<br>&&do_div2, &&do_add7, &&do_neg};<br>#define DISPATCH() goto *dispatch_table[code[pc++]]
int pc = 0;<br>int val = initval;
DISPATCH();<br>while (1) {<br>do_halt:<br>return val;<br>do_inc:<br>val++;<br>DISPATCH();<br>do_dec:<br>val--;<br>DISPATCH();<br>do_mul2:<br>val *= 2;<br>DISPATCH();<br>do_div2:<br>val /= 2;<br>DISPATCH();<br>do_add7:<br>val += 7;<br>DISPATCH();<br>do_neg:<br>val = -val;<br>DISPATCH();
Benchmarking
I did some simple benchmarking with random opcodes and the goto version is 25% faster than the switch version. This, naturally, depends on the data and so the results can differ for real-world programs.
Comments inside the CPython implementation note that using computed goto made the Python VM 15-20% faster, which is also consistent with other numbers I've seen mentioned online.
Why is it faster?
Further down in the post you'll find two "bonus" sections that contain annotated disassembly of the two functions shown above, compiled at the -O3 optimization level with GCC. It's there for the real low-level buffs among my readers, and as a future reference for myself. Here I aim to explain why the computed goto code is faster at a bit of a higher level, so if you feel there are not enough details, go over the disassembly in the bonus sections.
The computed goto version is faster because of two reasons:
The switch does a bit more per iteration because of bounds checking.
The effects of hardware branch prediction.
Doing less per iteration
If you examine the disassembly of the switch version, you'll see that it does the following per...