Why we should get rid of average CPU utilization | Jeremy Theocharis<br>JT'>Skip to main contentWhy we should get rid of average CPU utilization<br>2026-05-14<br>· Based on a talk at Cloud Native Aachen Meetup, April 2026<br>A Go function in our application kept getting cancelled in production.<br>The function had a tight timeout. The same code ran fine in our development setups, in our CI and CD pipelines, in every integration test we had. In production it would sometimes blow past the timeout and die with context deadline exceeded.<br>What made it worse was the state machine library1 we used. When its context got cancelled, it wouldn’t recover on its own. It would crash and hang. We couldn’t reproduce it.<br>When we talked to users, they reported the CPU utilization looked fine.<br>It took us weeks to find the cause.<br>Why average CPU isn’t enough<br>One of the first things you learn when you work with computers is: when it gets slow, you open the task manager (or the equivalent) and check the CPU. If it’s high, you look at what the process is, and you stop it or you work on it, and you continue.<br>On Linux servers you have top, htop, whatever, with a lot of other numbers on the screen. At least for me, when I was working with it, I always just looked at the average CPU. Then when you continue with your career, you provision VMs. Maybe on VMware or Hyper-V on-prem. Maybe on AWS, Azure, or Hetzner. You pick a number of vCPUs. If you look closer, it says something like “performance” or “dedicated” as a more expensive option. But at least I myself never really asked why. It was just cheaper without.<br>The instinct, taught to you by every tool, by every vendor, by every dashboard, fails here.<br>Every tool shows you average CPU utilization, but none of the tools help you interpret it. None of them tell you that CPU utilization isn’t linear to how much capacity you have. They just show you a percentage. The jump from 80% to 81% CPU utilization adds roughly 20× more wait time than the jump from 10% to 11%2. So even with 20% “headroom” at 80% utilization, latencies have already started climbing3:<br>CPU utilizationWait for a 10 ms request10%~11 ms80%~50 ms95%~200 msM/M/1 queueing model baseline.3<br>Low utilization: a new request waits about one slot for the current job to finish.<br>Higher utilization: the same request waits three slots. Reality is more complex of course (random arrivals, variable service times). The M/M/1 model captures the details.<br>Average CPU is the right metric for one question: are our CPUs utilized? That’s a cost question, and the IT department is right to ask it.<br>It only works if your workloads can wait. For latency-sensitive ones, higher utilization just means longer waits.<br>But in our case, our CPU utilization wasn’t even high. What we hit was a Linux kernel feature that Docker and Kubernetes use to enforce per-container limits, called the cgroup, and one of its byproducts: throttling.<br>There was a resource limit on the container, set to 2000m. We read it as “two CPUs.” The kernel read it as a time budget. When the budget runs out, the container is throttled until the next period begins.<br>None of the tools we or our customer had in front of us would show this. And that’s why it took us weeks to find the reason for the context deadline exceeded. Every graph said everything was fine. Every user said everything was fine.<br>It was extremely surprising.<br>How CFS throttling actually works<br>Let’s assume you’re running a service in a container that processes HTTP messages, and you’ve set resource limits because that’s what every guide recommends4.<br>Guaranteed QoS pod config: requests = limits = 2000m<br>kubectl top pod shows 800 millicores. Your Horizontal Pod Autoscaler (HPA) is configured to scale up at 80% utilization. 800m of a 2000m limit is 40 percent, far from the 80 percent target. Everything looks fine. Right?<br>No, let’s look closer. There are three numbers that determine what happens here:<br>Your resource limit: 2000m<br>The kernel’s CFS scheduling period: 100 ms by default.5<br>The host CPUs: 4 cores.<br>A 100 ms CFS scheduling period. With a 2000m limit, the container gets 200 ms of CPU time per period.<br>Now, why does it matter how many CPU cores the host has? Because this is where the abstraction leaks. The container can spend that 200 ms across every CPU core available on the node.<br>Now imagine an HTTP service. A request could come in that is resource intensive and could burst the entire available budget across all 4 cores within 50 ms.<br>A 50 ms burst across 4 cores exhausts the 200 ms budget.<br>When a second request arrives, it is throttled and has to wait 50 ms more until the next scheduling period.<br>The next 50 ms of wall clock are throttled.<br>Now imagine this repeats. If your load pattern is burst, idle, idle, idle, burst, your p99 latency can go through the roof. Yet every CPU graph still says everything is...