Performance Improvements in Libffi

Performance improvements in libffi | REPL Yell!libffi is a function call interpreter. You hand it a description of a function’s signature at runtime, and it works out, on the spot, how to place each argument and make the call. It interprets the calling convention the way a bytecode VM interprets instructions. Nothing is compiled ahead of time, because the whole point is that you don’t know the signature ahead of time. An interpreter is not what you reach for when you want speed. That is what JIT compilation is for, and some systems choose it instead. A runtime can JIT-compile a bespoke call stub for each signature, native code that drops the arguments into registers and jumps, with no interpretation left at runtime. It’s quicker, but it works by generating code at runtime into memory that’s both writable and executable, which is exactly what modern systems are trying to stamp out. So libffi stays an interpreter, on purpose. The question I set out to answer was how much faster it could get that way, by making better use of what it already knows instead of generating code at runtime or mapping any page writable and executable. The waste# When you call a function through libffi, the work splits across two places. ffi_prep_cif runs once per signature. It classifies the whole thing, but it keeps only two results: the size of the stack frame the call will need, and a small code for how the return value comes back. The frame size has to be known before the call is built, because any argument that doesn’t fit in a register spills to the stack, and that space is reserved up front. The return code is for afterward, because the result comes back in rax, or xmm0, or memory depending on the type, and something has to know where to read it from. Both are small and fixed-size, so they live in the ffi_cif. What prep throws away is the part it spent most of its time on: where each individual argument goes. So on every ffi_call, the marshalling code walks the argument list again and re-derives that placement from scratch before copying the values into place. For a three-argument call on x86-64 that’s around 650 instructions of bookkeeping, and it produces the identical answer every single time. Most of those instructions aren’t moving argument bytes. They’re deciding where the bytes go. The x86-64 calling convention has genuine rules, and applying them to a single argument means walking its type, recursing into a struct’s fields and chasing the pointers in its type descriptor, sorting each 8-byte chunk into an integer or floating-point register class, and checking whether it still fits in the registers that are left or has to spill to the stack. That is branch-heavy, pointer-chasing work, the sort a CPU runs slowly, and it reruns on every call to compute a placement that never changes. But function argument placement is a pure function of the signature. We can compute it once, remember it, and skip the work on every later call. A plan# The fix is a “plan”: the placement compiled into a flat list of moves, a tiny bytecode for one signature. If ffi_call re-deriving the placement on every call is like interpreting a program by re-walking its syntax tree each time, the plan is the compiled bytecode: the tree-walk happens once, and every later call just runs the flat list. build_plan walks the argument types once, classifies each one the way the ABI rules say, and emits a move per piece: this 8-byte word goes in rdi, that 32-bit int gets sign-extended into rsi, this double lands in an SSE slot, that oversized thing spills to the stack. With the plan in hand, making the call is just running the moves. No re-classification.

The opcodes are deliberately dumb. GP64 copies a word into a general register; SE8/SE16/SE32 sign-extend a narrow int; SSE64/SSE32 move a float; STACK memcpys a spilled argument. A three-argument call compiles to three or four of them. Here’s what two real signatures turn into: long (void *, void *, void *) long (void *, int, void *) GP64 avalue[0] -> rdi GP64 avalue[0] -> rdi GP64 avalue[1] -> rsi SE32 avalue[1] -> rsi (sign-extend) GP64 avalue[2] -> rdx GP64 avalue[2] -> rdx => all GP64: thunk => has an SE32: interpret

When every argument is a single 64-bit value in a general register, which is most pointer-passing code, the plan doesn’t even need the interpreter. It’s marked thunk-eligible, and a small hand-written thunk in .text loads the values straight from the argument array into the argument registers and calls. It skips the move loop, the intermediate register image, and the copying back and forth entirely. The call on the right keeps an int, so it needs the sign-extend, so it runs the move loop instead. The plan is plain data, and the thunk ships in the binary’s read-only text like any other function. Nothing is ever both writable and executable, the same property closures already get from static...

Performance Improvements in Libffi

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews