System call instrumentation on Linux/x86‑64 using memory‑indirect calls, part I

Rambles around computer science

Diverting trains of thought, wasting precious time

15 06 2026 --> Mon, 15 Jun 2026

System call instrumentation on Linux/x86-64 using memory-indirect calls (in vain?), part one

My libsystrap library provides a simple instrumentation of system calls in Linux x86-64 userland. However, its current implementation suffers a double-trap overhead: system calls become ud2, which generates a SIGILL trap. Then we run the system call itself from within the signal handler, causing a second trap and some interesting tricky cases.

There has been some interesting research in this space in recent years, including the Liteinst “instruction punning” paper, the closely related E9Patch paper (though both not specifically about system call instrumentation), later the “zpoline” paper (which definitely is), and some follow-ups for making the latter more robust (lazypoline, K23).

The core problem that all these approaches are solving is a pure accident of the Intel instruction encoding: all useful jump instructions are at least 5 bytes long, whereas often we want to patch smaller instructions, such as system call instructions which are all (essentially) two bytes long. So if you want to replace a system call with a jump, you have a problem.

The idea of instruction punning, simplifying horribly and specialising it to the system-call problem (it is more general), is that if we have an instruction sequence containing a two-byte system call (here using the syscall instruction, 0f 05)

... 0f 05 xx yy zz ...

then when we make it into a jump or call, we might be able to work with the bytes of the next instruction, since they form part of the relative jump offset. In fact we have one free byte to play with;

... e9 WW xx yy zz ...

i.e. we leave the xx, yy and zz bytes alone because the belong to the next instruction(s), but we can change WW. WW xx yy zz will be interpreted as 32-bit displacement and we ideally simply place some kind of trampoline code wherever that lands.

Unfortunately, with the machine being little-endian, WW is the least significant byte, so the jump target is fixed except for 256 bytes of wiggle room. It demands a statistical approach: as long as the high-order byte is not zero or very small, we have a good chance of jumping far enough away to land at some memory that is available to use. If not, we can fall back on a signal-generating option like ud2, or do something else. The E9Patch paper presents some head-twisting compound versions of instruction punning for increasing its coverage in such scenarios, without resorting to trapping approaches like ud2. Meanwhile, this scattered nature of trampolines will require a lot of virtual address space, roughly one page per patch site, but we can play virtual memory tricks to colocate multiple trampolines on the same physical page (the E9Patch tool also does this)..

The idea of zpoline is cleaner and does not rely on punning or statistical approaches. It's quite clever. We can always replace a 2-byte system call with

ff d0 call *%rax

... which will generate a call to a small nonnegative address, because %rax must be holding the system call number i.e. a small nonnegative integer. That's neat but it means you have to map some instructions at the very bottom page (address zero), which undoes the standard hardware-enforced protection against null pointer accesses. The paper suggests mitigating this by (1) using Intel memory protection keys to make this memory execute-only, and (2) catching “jump to null pointer” bugs by validating the return address against a bitmap or hash table recording the known patched system call sites. However, this is still non-ideal: many processors don't support memory protection keys, validating the return address takes time, and on Linux, mapping low memory requires system privileges. The approach also behaves unpredictably if buggy code invokes a system call with a high value in %rax, whereas the kernel would fail cleanly (with ENOSYS).

The zpoline work made me think: can we find similar tricks with different trade-offs by exploring other corners of the instruction encoding? In x86 I have always been fascinated by the segmentation features, so I was minded to explore there. All x86 processors, even 64-bit ones, always run with some form of segmentation permanently enabled. In protected mode, all memory accesses are first translated through one of two segment descriptor tables, global (system-wide) and local (typically per-process). These tables select the linear virtual address that is then pushed through the page tables, as a second layer of translation. Linux lets us modify the process's local descriptor table using the modify_ldt() system call. Could we find a 2-byte form that will indirect through this table to reach, somehow, our intended system call instrumentation?

Spoiler: sort of, but not really as I hoped.

Nevertheless, I...

System call instrumentation on Linux/x86‑64 using memory‑indirect calls, part I

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews