On-call isn't supposed to be this hard

Oncall isn't supposed to be this hard - FixBugs Blog | FixBugs | From Alert to Validated Fix Skip to main content

observabilityOncall isn't supposed to be this hard Well-configured alerts connect the symptom to traces, logs, deploys, and the suspect commit. Kirti Rathore·May 29, 2026·6 min read Oncall isn't supposed to be this hardBad Prometheus alerts tell an oncall engineer something is wrong, while good alerts connect the symptom to traces, logs, deploys, and the suspect commit.

That distinction sounds small until you're on-call and an alert storm appears.

You open one of the alerts and see:

[CRITICAL] CheckoutHighErrorRate - 7.3% 5xx in prod-eu-west-1

The alert is not wrong. Checkout is returning 500s. But it hasn't told you which host/container/VM to start investigating from.

let the wild hunt begin

The SRE / Developer now has all the work to do.

If you know what you're doing, you first check the alert definition.

A basic Prometheus setup usually looks like this:

- alert: CheckoutHighErrorRate expr: | sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m])) sum(rate(http_requests_total{service="checkout"}[5m])) > 0.05 for: 10m labels: severity: critical team: payments service: checkout annotations: summary: "Checkout 5xx rate is above 5%" runbook_url: "https://runbooks.corp/payments/checkout-5xx" dashboard_url: "https://grafana.corp/d/checkout"

This gives you some important pieces of information:

The alert aggregates HTTP errors over a 5 minute period and compares it to a threshold.

The alert is owned by the Payments team.

There is a playbook you can start from.

The raw PromQL graph shows why the alert fired but doesn't give much more context.

But there is no reason to celebrate just yet.

The real work begins now.

Adjust the time window to within 5 minutes of the alert time.

Open Grafana and check if the dashboards have any extra information.

Open Loki and write a query like {service_name="checkout"} | = "500".

Open Tempo and filter traces by time. Guess which trace represents the incident.

Open your CD pipeline and search for any deploys just before the alert.

At some point, several possible hypothesis appear.

Big newly introduced feature in the checkout-api@v2.4.1 looks fishy. High CPU usage on 3 out of 5 hosts that reported 5xx errors. Suspicious I/O errors on all the investigated hosts. Slow DB transactions.

Eventually the developer manages to reconstruct context across four tools, in about an hour if they know exactly what they're doing.

Meanwhile, there may be other fresh alerts to investigate.

good alerts tell you where to start looking

The same stack can behave very differently.

Not a different vendor. Not a more expensive alerting product.

The same stack, wired correctly to bubble up context.

Here is what it would look like for the Prometheus/Grafana/Tempo/Loki stack:

-> Prometheus exporter using OpenTelemetry SDK. -> histograms correlated with trace spans. -> Grafana exemplars enabled. -> Tempo setup with trace-to-logs enabled. -> deploy marker / service.version / commit SHA added as metadata with each alert.

The alert still starts with a metric. It should. Metrics are how you detect the symptom.

But the metric now carries a breadcrumb to a specific request.

Exemplars are the bridge from an aggregate bucket to a specific slow request.

Prometheus alerts do not naturally carry a trace_id. A histogram bucket is an aggregate. It is not a request.

Exemplars change that. A sampled measurement can attach the active trace_id to the bucket. Grafana can render that as a clickable diamond. Click it and Tempo opens the representative trace.

The trace shows the slow span and the context attached to it: database statement, feature flag, user, and service version.

In the good version, the selected span says:

service: db-primary operation: SELECT orders WHERE user_id=$1 duration: 1210ms db.rows_affected: 1110482 feature_flag.new_checkout: true service.version: 2.4.1

We see the slow database queries in the distributed trace .

Then Tempo's trace-to-logs link opens Loki for the exact trace.

Trace-to-logs only works if logs carry the same trace identifier.

The log line is not buried in a time-window query anymore:

slow query: seq scan on orders (1.1M rows), index not used trace_id=4bf92f3577b34da6a3ce929d0e0e4736 span_id=00f067aa0ba902b7 service.version=2.4.1 commit=7a3f9c2

Now the hypothesis is no longer vague.

checkout-api@v2.4.1 added the new order-history query path. The user-id column needs to be added as an index. The bad path is gated by feature_flag.new_checkout=true. Disable the flag or roll back 7a3f9c2.

the configuration is what makes the oncall experience fun

None of this is automatic and doesn't come automatically, whether you are using Prometheus + Grafana, Datadog, or New Relic.

The good path needs deliberate plumbing:

Page on symptoms: error rate, latency, traffic, saturation, or SLO burn.

Put team, service, severity, runbook_url,...

On-call isn't supposed to be this hard

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy