Zero-copy in Go: sendfile, splice, and the cost of io.Copy

mrngm1 pts0 comments

Zero-copy in Go: sendfile, splice, and the cost of io.Copy · SegflowSegflow<br>Assel Meher<br>@ Grafana Labs

Home<br>Posts<br>Resume

A small file-serving service of mine slowed to a crawl one afternoon after a<br>&ldquo;harmless&rdquo; middleware change. CPU on the server box doubled, throughput<br>roughly halved. The diff was a single line: instead of handing a *os.File<br>to io.Copy, somebody had wrapped it in a tiny logging reader to count<br>bytes.<br>That one wrap quietly turned off sendfile(2).<br>This post is about that fast path: what Go does for you for free, how to see<br>it actually fire, and the surprisingly easy ways to lose it.<br>The setup<br>Linux 6.6 / Ubuntu 24.04 (WSL2), AMD Ryzen 5 9600X, 16 GiB RAM<br>Go 1.22.12<br>512 MiB random-bytes file, page cache warm<br>Every benchmark below serves the same big.bin file over plain TCP to a Go<br>client on the same machine. Server pinned to CPU 0, client to CPU 1, so we<br>can read /usr/bin/time server-side and compare apples to apples. Syscall<br>counts come from a vanilla strace -c -e trace=read,write,sendfile,splice.<br>What sendfile actually does<br>A normal &ldquo;send this file&rdquo; looks like this:<br>disk -> page cache -> read() into user buffer -> write() into socket buffer -> NIC<br>^ copy 1 ^ copy 2<br>sendfile(2) collapses those two copies into one in-kernel transfer:<br>disk -> page cache --(sendfile)--> socket buffer -> NIC<br>^ no userspace round trip<br>No read, no write, no 32 KiB buffer bouncing through your address space.<br>The kernel just splices page-cache pages straight into the socket&rsquo;s send<br>queue. For socket-to-socket forwarding the equivalent is splice(2), which<br>moves bytes through a kernel pipe without ever materialising them in user<br>memory.<br>You don&rsquo;t call either of these directly in Go. The standard library does it<br>for you, when it can.<br>The fast path<br>The Go runtime gives *net.TCPConn a ReadFrom method. When you write<br>io.Copy(conn, f)

io.Copy checks whether the destination implements io.ReaderFrom. A<br>*net.TCPConn does, so the call gets dispatched to its ReadFrom. That<br>method&rsquo;s first job is to look at the source: is it a *os.File? Is it an<br>*io.LimitedReader wrapping a *os.File? If yes, it calls into<br>internal/poll.SendFile, which loops over sendfile(2) until the file is<br>drained.<br>The whole detection chain lives in two files:<br>net/sendfile_linux.go and os/zero_copy_linux.go. It is roughly:<br>// (simplified, in net/sendfile_linux.go)<br>lr, ok := r.(*io.LimitedReader)<br>if ok { remain, r = lr.N, lr.R }<br>f, ok := r.(*os.File)<br>if !ok { return 0, nil, false } // fall back<br>// ... sendfile loop ...

Two type assertions and a syscall loop. That&rsquo;s the whole thing.<br>Three handlers, one file<br>Here are the three reader shapes I want to compare. All three serve the<br>same 512 MiB file over plain TCP. The only difference is what gets passed<br>to io.Copy.<br>// raw: hand io.Copy a *os.File directly.<br>_, _ = io.Copy(conn, f)

// wrapped: hide *os.File behind a "just an io.Reader" struct.<br>type justReader struct{ r io.Reader }<br>func (j justReader) Read(p []byte) (int, error) { return j.r.Read(p) }<br>_, _ = io.Copy(conn, justReader{r: f})

// limit: wrap in *io.LimitedReader, the only wrapper the runtime sniffs.<br>_, _ = io.Copy(conn, io.LimitReader(f, fileSize))

A justReader does nothing. It&rsquo;s the minimal example of a piece of<br>middleware that &ldquo;just wants to count bytes&rdquo; or &ldquo;just wants to inject a<br>tracing span&rdquo; or any other innocent reason to slip an io.Reader in front<br>of the file. As far as the type system is concerned, the value is now an<br>io.Reader, full stop. The runtime&rsquo;s type switch on *os.File fails and<br>the optimization is gone.<br>io.LimitReader looks just as wrappy, but the runtime explicitly checks<br>for *io.LimitedReader before giving up, unwraps it, and keeps going. So<br>it preserves the fast path.<br>Three handlers, three different things happening under the hood.<br>Watching it with strace<br>Run each handler under strace -c -e trace=read,write,sendfile,splice,<br>fire five 512 MiB transfers, and look at the summary.<br>raw (io.Copy(conn, f)):<br>% time seconds usecs/call calls errors syscall<br>99.79 0.231981 78 2958 860 sendfile<br>0.15 0.000359 51 7 1 write<br>0.05 0.000126 18 7 read<br>100.00 0.232466 78 2972 861 total<br>2,958 sendfile calls, 7 reads, 7 writes. The reads and writes<br>are accept/setup chatter, not file data. The 860 &ldquo;errors&rdquo; are EAGAIN<br>returns where the socket buffer was full and the runtime poller bounced<br>back, which is normal under sendfile.<br>wrapped (io.Copy(conn, justReader{f})):<br>% time seconds usecs/call calls errors syscall<br>56.67 3.202339 48 65546 3 write<br>43.33 2.448353 37 65547 read<br>100.00 5.650692 43 131093 3 total<br>Zero sendfile. ~131 thousand combined read and write syscalls, all<br>of them 32 KiB chunks bouncing the data through a userspace buffer. That<br>32 KiB number is the default in io.copyBuffer. The wall time spent in<br>syscalls is roughly 24x the fast path.<br>A CPU profile of the wrapped handler under load makes the call chain<br>obvious:

Of 1,670 CPU samples collected...

copy file sendfile read write reader

Related Articles