Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go · SegflowSegflow<br>Assel Meher<br>@ Grafana Labs
Home<br>Posts<br>Resume
I had a 4 GiB file that’s almost entirely zeros, exactly one non-zero<br>int64 is hiding at offset Size - 8 (the last aligned slot). The task: find<br>that offset, as fast as possible, in Go on Linux.<br>It’s a deliberately silly problem. There’s no parsing, no indexing, no<br>cleverness on the algorithm side. The only thing it measures is how much data<br>we can pull through a CPU per second. Exactly the kind of micro-task that<br>exposes every layer of the stack: the Go runtime, the standard library, the<br>kernel, the page cache, the memory hierarchy, and SIMD, including Go 1.26’s<br>brand-new simd/archsimd package that lets you write AVX-512 in pure Go.<br>Starting from the most obvious os.ReadFile + for range we get 0.75<br>GB/s . Thirteen variants later we’re at 49 GB/s , a 66× speedup, and<br>we’ll know exactly which wall we hit and why.<br>The setup<br>Test box:<br>AMD Ryzen 5 9600X (Zen 5, 6c/12t, AVX2 and AVX-512)<br>15 GiB DDR5<br>WSL2 / Linux 6.6 / ext4 / NVMe SSD<br>Go 1.26<br>The haystack is exactly 4 GiB (4 bytes). I pwrite zeros over the<br>whole file once so the blocks are actually allocated (otherwise sparse-file<br>reads are free and meaningless), then plant a single fixed magic<br>int64 at offset Size - 8. The needle never moves , same position,<br>same value, across every run and every program invocation, so there is no<br>luck factor and the page cache state isn’t disturbed between iterations.<br>For every variant I run 5 timed iterations, drop the slowest and fastest, and<br>report the mean of the remaining three. All headline measurements are warm<br>cache (the 4 GiB file fits comfortably in RAM and is pre-read once before<br>each variant). At the end I’ll also show cold-cache numbers using<br>posix_fadvise(POSIX_FADV_DONTNEED).<br>The full source for every variant and the benchmark harness is in<br>browsable on GitHub.<br>V1: The naive one<br>Read the whole file into memory and walk every byte:<br>func (S) Search(path string) (int64, error) {<br>data, err := os.ReadFile(path)<br>if err != nil { return -1, err }<br>for i, b := range data {<br>if b != 0 {<br>return int64(i) &^ 7, nil<br>return -1, nil
os.ReadFile allocates a 4 GiB []byte and copy_to_user’s the whole file<br>into it. Then a tight Go loop walks 4 billion bytes looking for the first<br>non-zero one.
Execution time splits ~72% inside the scan loop and ~28% inside os.ReadFile (mostly runtime.makeslice zero-initializing 4 GiB of heap). The naive code is doing twice the work: allocate, copy, then scan.<br>Result: 749 MB/s. Most of that time is the allocation + kernel copy:<br>asking the Go runtime to grow the heap by 4 GiB per run is not free, and<br>once the working set blows past L3 the allocator pressure shows up clearly.<br>This is the baseline.<br>V2: bufio (the textbook answer)<br>Every “how do I read a big file in Go” Stack Overflow answer ever:<br>r := bufio.NewReaderSize(f, 116) // 64 KiB<br>for {<br>b, err := r.ReadByte()<br>...
~77% of CPU time is in bufio.(*Reader).ReadByte alone. Two billion calls, two billion bounds checks, two billion increments of an offset. The actual byte comparison is the cheap part.<br>Result: 755 MB/s, basically a tie with naive os.ReadFile (#v1).<br>bufio.Reader.ReadByte is one Go function call per byte. Four billion calls.<br>The compiler can’t inline through it and the cost of the call dwarfs the cost<br>of looking at a byte. naive os.ReadFile (#v1) paid a giant allocation tax up front; bufio pays a<br>giant function-call tax instead. Total wall time ends up almost identical.<br>Lesson: don’t pay function call overhead per byte if you can avoid it.<br>V3: Bigger chunks, scan 8 bytes at a time<br>Let’s stream the file in 1 MiB chunks into a reusable buffer and process each<br>chunk as []uint64:<br>buf := make([]byte, 120)<br>for {<br>n, err := io.ReadFull(f, buf)<br>words := unsafe.Slice((*uint64)(unsafe.Pointer(&buf[0])), n/8)<br>for i, w := range words {<br>if w != 0 { return off + int64(i)*8, nil }<br>off += int64(n)<br>if err != nil { break }
Two things changed:<br>One syscall per megabyte instead of one per byte. The kernel transfers<br>256 page-cache pages with a single copy_to_user.<br>8 bytes per iteration of the inner loop . Same number of branches and<br>compares as naive os.ReadFile (#v1) , but 8× the work per iteration.
About 60% of time is in internal/poll.(*FD).Read (the read syscall + copy_to_user) and 40% in the scan loop. We’ve cleanly amortized syscall overhead; the kernel copy is now the headline cost.<br>Result: 13.7 GB/s, an 18× jump. This is already pretty good. Most<br>people would stop here.<br>We won’t.<br>V4: mmap, the obvious next step<br>Why copy bytes from kernel space to user space when the kernel can just hand<br>us a window into its page cache?<br>data, _ := unix.Mmap(int(f.Fd()), 0, size, unix.PROT_READ, unix.MAP_SHARED)<br>defer unix.Munmap(data)<br>words := unsafe.Slice((*uint64)(unsafe.Pointer(&data[0])), size/8)<br>for i, w := range words {<br>if w != 0 { return...