How to curate observability data for AI agents

When we started building Multiplayer's debugging agent, we made the same mistake almost everyone makes. We gave our coding agent access to observability data and expected it to figure out what was relevant. It didn't. The agent called the wrong tools, chased the wrong signals, and produced fixes that looked plausible but failed in production. We were using state-of-the-art models, but we were handing them raw observability data without any curation or filtering. We later realized that we were just routing them noise. What follows is what we learned about what you actually have to do with observability data before it's fit for an AI agent to act on. The signal-to-noise problem Observability data has one of the worst signal-to-noise ratio of any data type you could feed an AI agent. A single production issue might involve hundreds of spans across a dozen services, thousands of log lines, missing request and response payloads, redacted headers, clock-skewed timestamps, and events distributed across tools that have never been correlated with each other. A human debugging this issue brings years of context: they know which services are noisy, which logs matter, which timestamps to trust, and roughly where in the stack the problem lives. They navigate the noise because they understand the system. An agent sees everything with equal weight. Garbage spans get the same attention as the one span that actually shows the failure. Thousands of log lines get processed before the agent can ask a useful question. And because context windows are finite and expensive, you burn through your budget before you've even framed the problem correctly. This is a data preparation problem. And it's one that has to be solved before the data reaches the agent, not by the agent itself. What data curation actually means Data curation for AI agents shouldn’t be confused with summarization or compression, which is what most engineering teams end up doing. In actuality, it's the process of transforming raw observability data into a structured, scoped, context-rich package that an agent can reason about correctly. That means making a series of deliberate decisions: what to include, what to exclude, how to group related signals, and what additional context the agent needs to understand the problem. At Multiplayer, we do this in four stages before any data reaches a coding agent. Stage one: group and correlate aggressively The first thing we do with raw observability data is group related events and correlate them across service boundaries. A single bug will typically surface across many sessions, environments, and services. Without grouping, each occurrence looks like a separate issue. And without correlation, the agent can't see the causal chain that connects a user action on the frontend to a failure deep in the backend. We correlate aggressively: user interactions, session metadata, network requests, backend traces, and log events get tied together into a single timeline before anything else happens. The agent needs to see that the click at 14:32:01 caused the cascade that showed up in the backend logs at 14:32:04. It can't infer that from timestamps alone (especially under any real load or clock skew). The correlation has to be built into the data structure before the agent sees it. We also deduplicate at this stage. The same bug appearing across a hundred user sessions becomes one issue, not a hundred separate signals. This is both because of cost and quality management. An agent acting on deduplicated, grouped data produces one PR for one issue. An agent acting on raw, ungrouped data produces dozens of PRs for the same issue, burns through tokens unnecessarily, or gets confused trying to reconcile conflicting signals from the same underlying failure. Stage two: assess fixability before routing to the agent Not every issue is worth routing to a coding agent, and not every issue is something a coding agent can fix. Before anything reaches the coding agent, we run a fixability assessment through a dedicated agent. Is this a deterministic, reproducible failure with a clear root cause? Or is it an intermittent, environment-specific issue that requires human judgment to diagnose? This matters for a few reasons. First, coding agents produce their worst outputs on problems they don't have enough context to solve correctly, which are often the hardest, most intermittent bugs. Routing those to a coding agent without human oversight wastes tokens and produces plausible-looking fixes that don't hold. Second, fixability scoring lets you prioritize. High-fixability issues (clear root cause, deterministic reproduction, well-scoped impact) go to the coding agent immediately. Lower-fixability issues get flagged for human review with the curated context already attached. The goal is to keep humans in the loop where human judgment is actually needed, and route everything else through the automated fix...

How to curate observability data for AI agents

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi