The kernel patch that almost broke our entire fleet | Geocodio
The kernel patch that almost broke our entire fleet
The delightfully boring geocoder
Search Geocodio<br>{ if (value) $nextTick(() => $refs.siteSearchInput.focus()) })"<br>@input.debounce.200ms="onQueryInput"<br>placeholder="Looking for something?"<br>autocomplete="off"<br>class="flex-1 min-w-0 h-[48px] px-6 rounded-full border border-black/20 bg-white font-body text-base leading-6 text-black placeholder-black/40 focus:outline-none focus:border-primary focus:ring-1 focus:ring-primary"<br>/>
Searching…
for
No results found.
Try a different search term or check your spelling.
0" class="relative grid gap-4 sm:grid-cols-3 mb-8" x-cloak>
Read more
0" x-cloak>
Read more
Load more<br>Loading…
Showing all results.
Same geocoder. New look.
See what’s new
Back to Code and Coordinates
Code and Coordinates
engineering at Geocodio
The kernel patch that almost broke our entire fleet
p]:m-0 px-4 md:px-0"><br>A new Hetzner server kept crashing every 10 minutes to 16 hours. Three suspects, one of them ruled out by a sibling server, and a trap waiting for the rest of our fleet.
May 2026
By Mathias Hansen
Engineering
Infrastructure
Security
Linux
p]:px-4 [&>h2]:px-4 [&>h3]:px-4 [&>h4]:px-4<br>md:[&>p]:px-0 md:[&>h2]:px-0 md:[&>h3]:px-0 md:[&>h4]:px-0<br>[&_p]:mb-4<br>[&_h2]:font-headline-alt [&_h2]:font-semibold [&_h2]:text-[22px] [&_h2]:leading-[30px] md:[&_h2]:text-[30px] md:[&_h2]:leading-[38px] [&_h2]:tracking-[-1px] [&_h2]:text-black [&_h2]:mt-10 [&_h2]:mb-4 [&_h2:first-child]:mt-0<br>[&_h3]:font-headline-alt [&_h3]:font-semibold [&_h3]:text-[18px] [&_h3]:leading-[26px] md:[&_h3]:text-[22px] md:[&_h3]:leading-[30px] [&_h3]:tracking-[-0.5px] [&_h3]:text-black [&_h3]:mt-8 [&_h3]:mb-3 [&_h3:first-child]:mt-0<br>[&_h4]:font-headline-alt [&_h4]:font-semibold [&_h4]:text-[16px] [&_h4]:leading-[24px] md:[&_h4]:text-[18px] md:[&_h4]:leading-[26px] [&_h4]:tracking-[-0.5px] [&_h4]:text-black [&_h4]:mt-6 [&_h4]:mb-2<br>[&_a]:text-primary [&_a]:underline [&_a]:hover:opacity-80<br>[&_strong]:font-semibold<br>[&_em]:italic<br>[&_li]:mb-1<br>[&_code]:bg-[#f5f2eb] [&_code]:px-1.5 [&_code]:py-0.5 [&_code]:rounded [&_code]:text-[14px] [&_code]:font-mono<br>[&_pre]:bg-[#2d2d2d] [&_pre]:text-[#f8f8f2] [&_pre]:p-4 [&_pre]:rounded-lg [&_pre]:overflow-x-auto [&_pre]:my-6 [&_pre_code]:bg-transparent [&_pre_code]:p-0">
Last week I spent three days convinced one of our newest Hetzner servers had a bad memory stick. It didn't.<br>It was a Linux kernel regression from a rushed security patch. It hadn't reached the rest of our fleet yet, but the next time we ran routine maintenance on any of those servers, it would have. The bug was sitting in the Debian security archive, signed and ready, waiting for us to come pull it down.
A quick clarification before we go deeper
This story is about our self-serve infrastructure, which runs on Hetzner bare-metal servers. Our enterprise tier runs on AWS with an entirely separate set of provisioning, patching, and validation processes -- and was unaffected throughout this incident.
The setup<br>We provisioned a new app server, app-N, on May 12. By the next morning it had gone down. Hetzner Robot showed the server as healthy and online. Tailscale showed it as unreachable. SSH didn't connect. The Hetzner status page said everything was fine.<br>A hard reset brought it back. Then 14 hours later, it went down again. Same pattern.<br>The frustrating part: out of several hundred servers in our herd, only this one was misbehaving. (We treat our servers like cattle, not pets, and the herd is usually pretty uniform.) Same hardware (a 12-core AMD Ryzen 5 3600 bare-metal box from Hetzner), same Docker stack, same workload, same Debian version. Nothing about app-N looked different from its siblings.<br>Once I got SSH back, I pulled the kernel logs from the previous boot.<br>Three suspects<br>I had three candidates.<br>Suspect 1: Beyla<br>The first boot I dug into pointed right at one of them:
watchdog: BUG: soft lockup - CPU#6 stuck for 48s! [beyla:5115]<br>watchdog: BUG: soft lockup - CPU#6 stuck for 74s! [beyla:5115]<br>...<br>watchdog: BUG: soft lockup - CPU#6 stuck for 205s! [beyla:5115]<br>NMI watchdog: Watchdog detected hard LOCKUP on cpu 11<br>igb 0000:23:00.0 enp35s0: NETDEV WATCHDOG: transmit queue 0 timed out 33100 ms
Grafana Beyla -- an eBPF-based auto-instrumentation tool that attaches probes to running processes for observability -- had wedged a CPU for over three minutes. Once a CPU is stuck, the kernel can't service the network card's transmit queue. The NIC's watchdog times out. The host stops responding to the network -- including Tailscale -- while the Hetzner BMC keeps seeing the hardware as alive. That explained the "online to Hetzner, offline to Tailscale" symptom perfectly.<br>Case closed?<br>Not quite. Several hundred other servers in our fleet run the exact same beyla container -- the same grafana/beyla:3.0.0 image, identical sha256, on identical hardware, with the same nginx workload underneath it. None of...