The SRE Guide to Kubernetes Observability: Red vs. Use Methods

PagCatOli1 pts0 comments

The SRE Guide to Kubernetes Observability: RED vs. USE Methods

Skip to main content

Pricing<br>Resources

Academy

About

Contact usDocs

Get Service Mesh Certified with Buoyant.<br>Enroll now!

Blog home<br>The SRE Guide to Kubernetes Observability: RED vs. USE Methods

Linkerd Production Readiness Pre-Launch Checklist

Download checklist

Heading

Relevant articles<br>Linkerd vs Istio

eBPF, Sidecars, and the<br>Future of The Service Mesh

A Kubernetes engineer’s guide to mTLS

Mesut Oezdil

July 2, 2026<br>Linkerd

If you build dashboards for Kubernetes services, you have probably heard of RED and USE. Both terms are often taught as if they were two views of the same thing, but they’re not. Each method answers a different question, applies to a different layer of the stack, and is blind to exactly what the other sees. This blog post draws that line clearly, shows how a service mesh hands you RED for free the moment you inject a proxy, and walks through a real incident where you need both methods open at the same time to understand what is happening.<br>What is the RED method?<br>Tom Wilkie introduced the RED method for instrumenting request-serving services in 2015, while building tooling for container-based systems. He first showed it at a Prometheus meetup in London. The idea came out of his earlier years as an SRE at Google, where he relied on the Four Golden Signals. RED keeps three of those four signals (saturation was dropped) so that every service can be uniformly instrumented and on-call engineers can cover services they never wrote themselves.<br>All three are request-level signals that share a single scope, and each one tells you about a different kind of problem:<br>Rate : requests per second the service receives<br>Errors : the fraction of those requests that fail<br>Duration : how long requests take, expressed as percentiles (p50, p95, p99).<br>RED is scoped to services and to the perspective of whoever is calling them. It tells you whether a service is doing its job as its callers see it, which in a service mesh is typically another service rather than the end user.<br>What is the USE method?<br>Brendan Gregg’s USE method applies to resources: CPU, memory, disk, network queues, and any other device that can be exhausted. CPU, memory, disk, network queues, and any other device that can be exhausted. He defined it in 2012 while working on low-level systems and kernel performance work. The goal was to turn bottleneck hunting into a fixed checklist. Instead of guessing which part of a machine is slow, you just asked three questions:<br>Utilization : What fraction of capacity is in use?<br>Saturation : How much work is queued waiting for the resource?<br>Errors : What device-level faults are occurring?<br>That resource framing is why USE talks in terms of devices and capacity. It tells you whether a resource is healthy. The question it answers stops at the resource boundary: CPU, memory, disk, and queue depth. Whether the services running on that resource are serving callers correctly is a different question entirely.<br>Why they aren’t the same question<br>The confusion between RED and USE is unsurprising. Let's say your team wants a service health dashboard. You pull node CPU, memory, and network utilization into Grafana, wire up alerts, and call it done. Alerts will fire when the node is stressed.<br>At 2 am, a customer reports slowness. The on-call engineer opens the node dashboard and sees that CPU is at 43% and memory looks good. There are no alerts, so the on-call engineer goes back to sleep.<br>What the node dashboard missed is that the frontend pod’s CPU limit was set too tight for its traffic volume. The pod was being throttled at 100% of its cgroup quota while the node as a whole sat under-utilized. Pod-level CPU throttling doesn’t show up in node-level utilization. Those are different measurements of different things. At the same time, p99 latency on frontend had climbed from 196 ms to 288 ms, and throughput had dropped by a third. Success rate was still 100%, so the availability alert stayed quiet. A RED dashboard on frontend would have surfaced the latency rise on the first scrape interval. The node USE dashboard showed nothing wrong because, at the node level, nothing was wrong.<br>The reverse direction is equally blind. A cpu-hog pod running stress --cpu 4 against its cgroup limit shows a CPU throttle ratio approaching 1.0 in the cAdvisor metrics. It serves no requests. A RED dashboard for that pod would have no data at all, because there is no traffic to measure. USE is the only lens that sees resource exhaustion in a workload that doesn’t serve requests.<br>A distinction worth pointing out is that RED errors are failed requests and USE errors are device faults, so even though they share the letter E, they measure different things at different layers.<br>Here's how you should use these metrics: If you’re paged because something is slow or failing, open RED first. RED tells you which service is hurting and how badly. Then open USE to find out why. If you...

service different node level requests resource

Related Articles