Rambles around computer science
Rambles around computer science
Diverting trains of thought, wasting precious time
15 06 2026<br>--><br>Mon, 15 Jun 2026
System call instrumentation on Linux/x86-64 using memory-indirect calls (in vain?), part one
My libsystrap library<br>provides a simple instrumentation of system calls<br>in Linux x86-64 userland.<br>However, its current implementation suffers a double-trap overhead:<br>system calls become ud2, which generates a SIGILL trap.<br>Then we run the system call itself from within the signal handler,<br>causing a second trap and some<br>interesting tricky cases.
There has been some interesting research in this space in recent years, including<br>the Liteinst “instruction punning” paper,<br>the closely related E9Patch paper<br>(though both not specifically about system call instrumentation),<br>later the “zpoline” paper (which definitely is),<br>and some follow-ups for making the latter more robust<br>(lazypoline,<br>K23).
The core problem that all these approaches are solving is a pure accident<br>of the Intel instruction encoding:<br>all useful jump instructions are at least 5 bytes long,<br>whereas often we want to patch smaller instructions,<br>such as system call instructions which are all (essentially)<br>two bytes long.<br>So if you want to replace a system call with a jump,<br>you have a problem.
The idea of instruction punning, simplifying horribly<br>and specialising it to the system-call problem (it is more general),<br>is that if we have an instruction sequence containing a two-byte system call<br>(here using the syscall instruction, 0f 05)
... 0f 05 xx yy zz ...
then when we make it into a jump or call, we might be able to work with the<br>bytes of the next instruction, since they form part of the relative jump offset.<br>In fact we have one free byte to play with;
... e9 WW xx yy zz ...
i.e. we leave the xx, yy and<br>zz bytes alone because the belong to the next instruction(s),<br>but we can change WW.<br>WW xx yy zz will be interpreted as 32-bit<br>displacement and we ideally simply place some kind of trampoline code wherever that lands.
Unfortunately, with the machine being little-endian,<br>WW is the least significant byte, so the<br>jump target is fixed except for 256 bytes of wiggle room.<br>It demands a statistical approach:<br>as long as the high-order byte is not zero or very small,<br>we have a good chance of jumping far enough away to<br>land at some memory that is available to use.<br>If not, we can fall back on a signal-generating option like ud2,<br>or do something else.<br>The E9Patch paper presents some head-twisting compound versions of instruction punning<br>for increasing its coverage in such scenarios,<br>without resorting to trapping approaches like ud2.<br>Meanwhile, this scattered nature of trampolines will require a lot of virtual<br>address space, roughly one page per patch site,<br>but we can play virtual memory tricks to colocate multiple trampolines on the same physical page<br>(the E9Patch tool also does this)..
The idea of zpoline is cleaner and does not<br>rely on punning or statistical approaches.<br>It's quite clever.<br>We can always replace a 2-byte system call with
ff d0 call *%rax
... which will generate a call to a small nonnegative address,<br>because %rax must be holding the system call number<br>i.e. a small nonnegative integer.<br>That's neat but it means you have to map some instructions<br>at the very bottom page (address zero), which undoes the standard hardware-enforced<br>protection against null pointer accesses.<br>The paper suggests mitigating this by<br>(1) using Intel memory protection keys to make this memory execute-only, and<br>(2) catching “jump to null pointer” bugs by validating the return address<br>against a bitmap or hash table recording the known patched system call sites.<br>However, this is still non-ideal:<br>many processors<br>don't support memory protection keys,<br>validating the return address takes time, and<br>on Linux, mapping low memory requires system privileges.<br>The approach also behaves unpredictably if buggy code<br>invokes a system call with a high value in %rax,<br>whereas the kernel would fail cleanly (with ENOSYS).
The zpoline work made me think:<br>can we find similar tricks with different trade-offs<br>by exploring other corners of the instruction encoding?<br>In x86 I have always been fascinated by the segmentation features,<br>so I was minded to explore there.<br>All x86 processors, even 64-bit ones,<br>always run with some form of segmentation permanently enabled.<br>In protected mode, all memory accesses are first translated through one of two<br>segment descriptor tables, global (system-wide) and local (typically per-process).<br>These tables select the linear virtual address<br>that is then pushed through the page tables, as a second layer of translation.<br>Linux lets us modify the process's local descriptor table using the<br>modify_ldt() system call.<br>Could we find a 2-byte form that will indirect through this table to reach, somehow,<br>our intended system call instrumentation?
Spoiler: sort of, but not really as I hoped.
Nevertheless, I...