The kernel patch that almost broke our fleet

The kernel patch that almost broke our entire fleet | Geocodio

The kernel patch that almost broke our entire fleet

The delightfully boring geocoder

Search Geocodio { if (value) $nextTick(() => $refs.siteSearchInput.focus()) })" @input.debounce.200ms="onQueryInput" placeholder="Looking for something?" autocomplete="off" class="flex-1 min-w-0 h-[48px] px-6 rounded-full border border-black/20 bg-white font-body text-base leading-6 text-black placeholder-black/40 focus:outline-none focus:border-primary focus:ring-1 focus:ring-primary" />

Searching…

for

No results found.

Try a different search term or check your spelling.

0" class="relative grid gap-4 sm:grid-cols-3 mb-8" x-cloak>

0" x-cloak>

Load more Loading…

Showing all results.

Same geocoder. New look.

See what’s new

Back to Code and Coordinates

Code and Coordinates

engineering at Geocodio

The kernel patch that almost broke our entire fleet

p]:m-0 px-4 md:px-0"> A new Hetzner server kept crashing every 10 minutes to 16 hours. Three suspects, one of them ruled out by a sibling server, and a trap waiting for the rest of our fleet.

May 2026

By Mathias Hansen

Engineering

Infrastructure

Security

Linux

p]:px-4 [&>h2]:px-4 [&>h3]:px-4 [&>h4]:px-4 md:[&>p]:px-0 md:[&>h2]:px-0 md:[&>h3]:px-0 md:[&>h4]:px-0 [&_p]:mb-4 [&_h2]:font-headline-alt [&_h2]:font-semibold [&_h2]:text-[22px] [&_h2]:leading-[30px] md:[&_h2]:text-[30px] md:[&_h2]:leading-[38px] [&_h2]:tracking-[-1px] [&_h2]:text-black [&_h2]:mt-10 [&_h2]:mb-4 [&_h2:first-child]:mt-0 [&_h3]:font-headline-alt [&_h3]:font-semibold [&_h3]:text-[18px] [&_h3]:leading-[26px] md:[&_h3]:text-[22px] md:[&_h3]:leading-[30px] [&_h3]:tracking-[-0.5px] [&_h3]:text-black [&_h3]:mt-8 [&_h3]:mb-3 [&_h3:first-child]:mt-0 [&_h4]:font-headline-alt [&_h4]:font-semibold [&_h4]:text-[16px] [&_h4]:leading-[24px] md:[&_h4]:text-[18px] md:[&_h4]:leading-[26px] [&_h4]:tracking-[-0.5px] [&_h4]:text-black [&_h4]:mt-6 [&_h4]:mb-2 [&_a]:text-primary [&_a]:underline [&_a]:hover:opacity-80 [&_strong]:font-semibold [&_em]:italic [&_li]:mb-1 [&_code]:bg-[#f5f2eb] [&_code]:px-1.5 [&_code]:py-0.5 [&_code]:rounded [&_code]:text-[14px] [&_code]:font-mono [&_pre]:bg-[#2d2d2d] [&_pre]:text-[#f8f8f2] [&_pre]:p-4 [&_pre]:rounded-lg [&_pre]:overflow-x-auto [&_pre]:my-6 [&_pre_code]:bg-transparent [&_pre_code]:p-0">

Last week I spent three days convinced one of our newest Hetzner servers had a bad memory stick. It didn't. It was a Linux kernel regression from a rushed security patch. It hadn't reached the rest of our fleet yet, but the next time we ran routine maintenance on any of those servers, it would have. The bug was sitting in the Debian security archive, signed and ready, waiting for us to come pull it down.

A quick clarification before we go deeper

This story is about our self-serve infrastructure, which runs on Hetzner bare-metal servers. Our enterprise tier runs on AWS with an entirely separate set of provisioning, patching, and validation processes -- and was unaffected throughout this incident.

The setup We provisioned a new app server, app-N, on May 12. By the next morning it had gone down. Hetzner Robot showed the server as healthy and online. Tailscale showed it as unreachable. SSH didn't connect. The Hetzner status page said everything was fine. A hard reset brought it back. Then 14 hours later, it went down again. Same pattern. The frustrating part: out of several hundred servers in our herd, only this one was misbehaving. (We treat our servers like cattle, not pets, and the herd is usually pretty uniform.) Same hardware (a 12-core AMD Ryzen 5 3600 bare-metal box from Hetzner), same Docker stack, same workload, same Debian version. Nothing about app-N looked different from its siblings. Once I got SSH back, I pulled the kernel logs from the previous boot. Three suspects I had three candidates. Suspect 1: Beyla The first boot I dug into pointed right at one of them:

watchdog: BUG: soft lockup - CPU#6 stuck for 48s! [beyla:5115] watchdog: BUG: soft lockup - CPU#6 stuck for 74s! [beyla:5115] ... watchdog: BUG: soft lockup - CPU#6 stuck for 205s! [beyla:5115] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11 igb 0000:23:00.0 enp35s0: NETDEV WATCHDOG: transmit queue 0 timed out 33100 ms

Grafana Beyla -- an eBPF-based auto-instrumentation tool that attaches probes to running processes for observability -- had wedged a CPU for over three minutes. Once a CPU is stuck, the kernel can't service the network card's transmit queue. The NIC's watchdog times out. The host stops responding to the network -- including Tailscale -- while the Hetzner BMC keeps seeing the hardware as alive. That explained the "online to Hetzner, offline to Tailscale" symptom perfectly. Case closed? Not quite. Several hundred other servers in our fleet run the exact same beyla container -- the same grafana/beyla:3.0.0 image, identical sha256, on identical hardware, with the same nginx workload underneath it. None of...

The kernel patch that almost broke our fleet

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast