Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go

Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go · SegflowSegflow Assel Meher @ Grafana Labs

Home Posts Resume

I had a 4 GiB file that’s almost entirely zeros, exactly one non-zero int64 is hiding at offset Size - 8 (the last aligned slot). The task: find that offset, as fast as possible, in Go on Linux. It’s a deliberately silly problem. There’s no parsing, no indexing, no cleverness on the algorithm side. The only thing it measures is how much data we can pull through a CPU per second. Exactly the kind of micro-task that exposes every layer of the stack: the Go runtime, the standard library, the kernel, the page cache, the memory hierarchy, and SIMD, including Go 1.26’s brand-new simd/archsimd package that lets you write AVX-512 in pure Go. Starting from the most obvious os.ReadFile + for range we get 0.75 GB/s . Thirteen variants later we’re at 49 GB/s , a 66× speedup, and we’ll know exactly which wall we hit and why. The setup Test box: AMD Ryzen 5 9600X (Zen 5, 6c/12t, AVX2 and AVX-512) 15 GiB DDR5 WSL2 / Linux 6.6 / ext4 / NVMe SSD Go 1.26 The haystack is exactly 4 GiB (4 bytes). I pwrite zeros over the whole file once so the blocks are actually allocated (otherwise sparse-file reads are free and meaningless), then plant a single fixed magic int64 at offset Size - 8. The needle never moves , same position, same value, across every run and every program invocation, so there is no luck factor and the page cache state isn’t disturbed between iterations. For every variant I run 5 timed iterations, drop the slowest and fastest, and report the mean of the remaining three. All headline measurements are warm cache (the 4 GiB file fits comfortably in RAM and is pre-read once before each variant). At the end I’ll also show cold-cache numbers using posix_fadvise(POSIX_FADV_DONTNEED). The full source for every variant and the benchmark harness is in browsable on GitHub. V1: The naive one Read the whole file into memory and walk every byte: func (S) Search(path string) (int64, error) { data, err := os.ReadFile(path) if err != nil { return -1, err } for i, b := range data { if b != 0 { return int64(i) &^ 7, nil return -1, nil

os.ReadFile allocates a 4 GiB []byte and copy_to_user’s the whole file into it. Then a tight Go loop walks 4 billion bytes looking for the first non-zero one.

Execution time splits ~72% inside the scan loop and ~28% inside os.ReadFile (mostly runtime.makeslice zero-initializing 4 GiB of heap). The naive code is doing twice the work: allocate, copy, then scan. Result: 749 MB/s. Most of that time is the allocation + kernel copy: asking the Go runtime to grow the heap by 4 GiB per run is not free, and once the working set blows past L3 the allocator pressure shows up clearly. This is the baseline. V2: bufio (the textbook answer) Every “how do I read a big file in Go” Stack Overflow answer ever: r := bufio.NewReaderSize(f, 116) // 64 KiB for { b, err := r.ReadByte() ...

~77% of CPU time is in bufio.(*Reader).ReadByte alone. Two billion calls, two billion bounds checks, two billion increments of an offset. The actual byte comparison is the cheap part. Result: 755 MB/s, basically a tie with naive os.ReadFile (#v1). bufio.Reader.ReadByte is one Go function call per byte. Four billion calls. The compiler can’t inline through it and the cost of the call dwarfs the cost of looking at a byte. naive os.ReadFile (#v1) paid a giant allocation tax up front; bufio pays a giant function-call tax instead. Total wall time ends up almost identical. Lesson: don’t pay function call overhead per byte if you can avoid it. V3: Bigger chunks, scan 8 bytes at a time Let’s stream the file in 1 MiB chunks into a reusable buffer and process each chunk as []uint64: buf := make([]byte, 120) for { n, err := io.ReadFull(f, buf) words := unsafe.Slice((*uint64)(unsafe.Pointer(&buf[0])), n/8) for i, w := range words { if w != 0 { return off + int64(i)*8, nil } off += int64(n) if err != nil { break }

Two things changed: One syscall per megabyte instead of one per byte. The kernel transfers 256 page-cache pages with a single copy_to_user. 8 bytes per iteration of the inner loop . Same number of branches and compares as naive os.ReadFile (#v1) , but 8× the work per iteration.

About 60% of time is in internal/poll.(*FD).Read (the read syscall + copy_to_user) and 40% in the scan loop. We’ve cleanly amortized syscall overhead; the kernel copy is now the headline cost. Result: 13.7 GB/s, an 18× jump. This is already pretty good. Most people would stop here. We won’t. V4: mmap, the obvious next step Why copy bytes from kernel space to user space when the kernel can just hand us a window into its page cache? data, _ := unix.Mmap(int(f.Fd()), 0, size, unix.PROT_READ, unix.MAP_SHARED) defer unix.Munmap(data) words := unsafe.Slice((*uint64)(unsafe.Pointer(&data[0])), size/8) for i, w := range words { if w != 0 { return...

Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan