Zero-copy in Go: sendfile, splice, and the cost of io.Copy

Zero-copy in Go: sendfile, splice, and the cost of io.Copy · SegflowSegflow Assel Meher @ Grafana Labs

Home Posts Resume

A small file-serving service of mine slowed to a crawl one afternoon after a “harmless” middleware change. CPU on the server box doubled, throughput roughly halved. The diff was a single line: instead of handing a *os.File to io.Copy, somebody had wrapped it in a tiny logging reader to count bytes. That one wrap quietly turned off sendfile(2). This post is about that fast path: what Go does for you for free, how to see it actually fire, and the surprisingly easy ways to lose it. The setup Linux 6.6 / Ubuntu 24.04 (WSL2), AMD Ryzen 5 9600X, 16 GiB RAM Go 1.22.12 512 MiB random-bytes file, page cache warm Every benchmark below serves the same big.bin file over plain TCP to a Go client on the same machine. Server pinned to CPU 0, client to CPU 1, so we can read /usr/bin/time server-side and compare apples to apples. Syscall counts come from a vanilla strace -c -e trace=read,write,sendfile,splice. What sendfile actually does A normal “send this file” looks like this: disk -> page cache -> read() into user buffer -> write() into socket buffer -> NIC ^ copy 1 ^ copy 2 sendfile(2) collapses those two copies into one in-kernel transfer: disk -> page cache --(sendfile)--> socket buffer -> NIC ^ no userspace round trip No read, no write, no 32 KiB buffer bouncing through your address space. The kernel just splices page-cache pages straight into the socket’s send queue. For socket-to-socket forwarding the equivalent is splice(2), which moves bytes through a kernel pipe without ever materialising them in user memory. You don’t call either of these directly in Go. The standard library does it for you, when it can. The fast path The Go runtime gives *net.TCPConn a ReadFrom method. When you write io.Copy(conn, f)

io.Copy checks whether the destination implements io.ReaderFrom. A *net.TCPConn does, so the call gets dispatched to its ReadFrom. That method’s first job is to look at the source: is it a *os.File? Is it an *io.LimitedReader wrapping a *os.File? If yes, it calls into internal/poll.SendFile, which loops over sendfile(2) until the file is drained. The whole detection chain lives in two files: net/sendfile_linux.go and os/zero_copy_linux.go. It is roughly: // (simplified, in net/sendfile_linux.go) lr, ok := r.(*io.LimitedReader) if ok { remain, r = lr.N, lr.R } f, ok := r.(*os.File) if !ok { return 0, nil, false } // fall back // ... sendfile loop ...

Two type assertions and a syscall loop. That’s the whole thing. Three handlers, one file Here are the three reader shapes I want to compare. All three serve the same 512 MiB file over plain TCP. The only difference is what gets passed to io.Copy. // raw: hand io.Copy a *os.File directly. _, _ = io.Copy(conn, f)

// wrapped: hide *os.File behind a "just an io.Reader" struct. type justReader struct{ r io.Reader } func (j justReader) Read(p []byte) (int, error) { return j.r.Read(p) } _, _ = io.Copy(conn, justReader{r: f})

// limit: wrap in *io.LimitedReader, the only wrapper the runtime sniffs. _, _ = io.Copy(conn, io.LimitReader(f, fileSize))

A justReader does nothing. It’s the minimal example of a piece of middleware that “just wants to count bytes” or “just wants to inject a tracing span” or any other innocent reason to slip an io.Reader in front of the file. As far as the type system is concerned, the value is now an io.Reader, full stop. The runtime’s type switch on *os.File fails and the optimization is gone. io.LimitReader looks just as wrappy, but the runtime explicitly checks for *io.LimitedReader before giving up, unwraps it, and keeps going. So it preserves the fast path. Three handlers, three different things happening under the hood. Watching it with strace Run each handler under strace -c -e trace=read,write,sendfile,splice, fire five 512 MiB transfers, and look at the summary. raw (io.Copy(conn, f)): % time seconds usecs/call calls errors syscall 99.79 0.231981 78 2958 860 sendfile 0.15 0.000359 51 7 1 write 0.05 0.000126 18 7 read 100.00 0.232466 78 2972 861 total 2,958 sendfile calls, 7 reads, 7 writes. The reads and writes are accept/setup chatter, not file data. The 860 “errors” are EAGAIN returns where the socket buffer was full and the runtime poller bounced back, which is normal under sendfile. wrapped (io.Copy(conn, justReader{f})): % time seconds usecs/call calls errors syscall 56.67 3.202339 48 65546 3 write 43.33 2.448353 37 65547 read 100.00 5.650692 43 131093 3 total Zero sendfile. ~131 thousand combined read and write syscalls, all of them 32 KiB chunks bouncing the data through a userspace buffer. That 32 KiB number is the default in io.copyBuffer. The wall time spent in syscalls is roughly 24x the fast path. A CPU profile of the wrapped handler under load makes the call chain obvious:

Of 1,670 CPU samples collected...

Zero-copy in Go: sendfile, splice, and the cost of io.Copy

Related Articles

(no title)

Scientists reverse brain aging, with a nasal spray

AI has torched the market for junior programmers

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org