Treating Agent Reasoning as a Span

AI Agents Run To Completion - by Forest Mars

There is No Antimemetics Blog

SubscribeSign in

AI Agents Run To Completion Treating reasoning as infrastructure is collapsing the stack

Forest Mars Mar 09, 2026

The Agentic Web has two requirements that have to work simultaneously: trust and observability. Trust without observability is recklessness. Observability without trust is security theater. Most organizations are discovering they have neither, certainly not at scale, which is why agent pilots aren’t graduating to production. The question CTOs are asking is how do we give agents write access to production systems without creating a career-ending incident? The answer everyone seems to want is “better prompts” or “fine-tuning” or “guardrails.” But the answer that actually works is treating agent reasoning as infrastructure. How are we building this? Reasoning as a Span

Production-ready infrastructure doesn’t just ask ‘what did the agent do,’ but ‘what was it thinking when it did it.

Google Cloud automatically enabled OpenTelemetry ingestion endpoints for all projects on Wednesday (March 4.) This matters because modern observability infrastructure treats telemetry as first-class code, with automated agents using this data to perform self-healing deployments. The shift is from humans debugging what agents did, to agents debugging themselves using the same telemetry infrastructure. So maybe it’s actually O11y 3.0. Crucial to this is what your implicit model of ‘coding’ is, architecturally. Consider Github Agents, the somewhat anticipated product released late January by. what is presumptively a top notch product engineering team, but which treats coding as a series of chat sessions, rather than a distributed systems problem, as Spotify’s Honk, for example, does. (Protip: you can’t solve a layer 2 problem with a layer 1 tool.) OpenHands (formerly OpenDevin) hit v1.3.0 in February with full support for Agent Client Protocol (ACP), making it compatible with almost every modern IDE and CI/CD pipeline. The project has arguably become the most important open-source initiative in software engineering for 2026. Unlike closed agents where reasoning is opaque, you can see the “Thought Content” in OpenHands. The reasoning chain is visible, traceable, auditable. When an agent makes a decision, you can see why. When it fails, you can debug the reasoning. This visible reasoning chain points to something bigger: when you pipe an agent’s chain into a modern traceability pipeline (Honeycomb, Chronosphere, Grafana Cloud), you treat AI reasoning like a first-class execution trace. Not as metadata or logs, but as spans in a distributed trace where you can see the decision chain that led to every action. No longer using system level events as proxy to infer explanation. This is what production-ready agentic infrastructure looks like. When an agent makes a decision at 2am that takes down a service, you don’t reconstruct what happened from logs. You have the full reasoning trace showing: what context it had, what options it considered, why it chose what it did, what it expected to happen. The same infrastructure you use to debug distributed systems now debugs distributed intelligence. The architectural requirement is straightforward but most organizations haven’t built it: every agent action must generate a trace that captures both execution and reasoning. Thought Content provides a reasoning chain that can be captured via its event stream and exported to OpenTelemetry as custom span attributes or logs. Without this, you’re deploying black boxes to production and hoping nothing breaks in ways you can’t explain to compliance. Which is to say, nothing breaking. Memory State Management for Agents

Organizations treating agent memory as ‘just a bigger context window’ are solving the wrong problem.

Snowflake Cortex fully integrated AI functions into standard SQL in February, allowing real-time text-to-SQL and unstructured data analysis without leaving the warehouse. MongoDB’s Voyage 4 models set new standards for retrieval accuracy in RAG. But the real shift is episodic memory: platforms introducing ‘state management’ for agents directly on the data tier. This is the convergence of the agentic data stack. Agents don’t just query databases. They maintain long-term memory of previous transactions and interactions stored directly in the data layer. When an agent needs to understand what happened last week or why a previous decision was made, that’s not in a prompt or a context window. It’s not even in a database per se, but committed to the agent’s knowledge graph: versioned, queryable, auditable, and in principle, can be content-addressable. Mastra’s Datasets (18 Feb) exemplify this: versioned test cases with native JSON schema validation and SCD-2 versioning (the data warehousing technique of using surrogate keys for immutable updates.) What’s interesting here is an ostensible application framework is taking on deployment...

Treating Agent Reasoning as a Span

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews