How AI Agents Actually Work: An Architectural Deep Dive | DeepResearch NinjaSkip to main content<br>Table of Contents<br>How AI Agents Actually Work: An Architectural Deep Dive An analysis of the patterns, infrastructure, and trade-offs behind the systems that have redefined what large language models can do Research Technology AI Agents LLM ReAct Tool Use Multi-Agent Systems Observability Software Engineering Claude Code<br>Executive Summary<br>The term “AI agent” has become one of the most overloaded in modern tech, but at its core it refers to a simple pattern: a large language model (LLM) connected to external tools and operating in a loop where it reasons about what to do, calls a tool, observes the result, and repeats until the task is complete. This pattern, known as ReAct after the 2022 paper “Synergizing Reasoning and Acting in Language Models,” has become the foundation of every production AI agent today.<br>What makes agents work well is not the model itself but the surrounding infrastructure: how context windows are managed across thousands of tool calls, how tools are designed for non-deterministic consumers, and how safety boundaries are enforced. A widely-circulated claim has become the defining statistic in this space: Claude Code’s leaked source code revealed only about 1.6% of its codebase constitutes AI decision logic, with the remaining 98.4% being operational infrastructure [3]. This figure is disputed: critics argue it misinterprets how the Liu et al. paper categorizes different kinds of code, and that the distinction between “AI logic” and “infrastructure” is itself an interpretive choice rather than a fact about the code. Regardless of the exact percentage, the underlying intuition holds: production agent systems are dominated by operational engineering.<br>The architecture has evolved through several identifiable layers:<br>The ReAct loop (Thought → Action → Observation) interleaves reasoning traces with external actions so the model can induce, track, and update plans while interacting with real data sources.<br>Tool use connects the model to APIs, files, databases, and other systems. The key insight is that tools must be designed specifically for agents, i.e., non-deterministic consumers, not just wrapped as API endpoints.<br>Memory comes in two forms: short-term (in-context learning bounded by the context window) and long-term (external vector stores via Retrieval-Augmented Generation).<br>Planning and composition patterns (orchestrator-workers, evaluator-optimizer, parallelization) allow agents to handle complex multi-step tasks.<br>Multi-agent systems delegate subtasks to specialized workers, trading exponential token costs for dramatic gains in capability on open-ended problems.<br>Observability (distributed tracing via OpenTelemetry GenAI semantic conventions, infinite loop detection, cost attribution, and session replay) has emerged as a critical operational layer. Without it, debugging non-deterministic agent behavior is nearly impossible.<br>The most important finding from this research is that agent architecture has converged around a small set of well-understood patterns. The competition between framework vendors (LangChain, CrewAI, OpenAI’s SDKs, Anthropic’s Agent SDK) is largely about ergonomics. Real engineering effort goes into context management, tool design, and reliability, areas where the best practitioners have accumulated significant domain knowledge.<br>A second important finding is that the gap between agent benchmarks and real-world performance is much wider than commonly assumed: 95% of enterprise AI pilots deliver zero measurable ROI [25], and roughly half of SWE-bench-passing PRs would not be merged by real maintainers [17]. The field’s primary bottleneck is now evaluation methodology, not model capability [21].<br>A third finding: the “agent winter” critique has empirical backing. Enterprise adoption has been slower and more cautious than early hype suggested, with Gartner predicting 40% of agentic AI projects will be scrapped by 2027, citing “rising costs, unclear business value, and integration complexity,” and PwC identifying integration complexity (67%), lack of monitoring (58%), and unclear escalation paths (52%) as the top causes of pilot failure.<br>1. Definitions: What Is an “Agent” and How Does It Differ from Other AI Systems?<br>The word “agent” has a long history in computer science. The classic definition from Russell and Norvig’s Artificial Intelligence: A Modern Approach describes an agent as anything that perceives its environment through sensors and acts upon that environment through actuators. This is a broad definition; a thermostat is technically an agent.<br>In the modern AI literature, the term has narrowed. Anthropic defines agents as “systems where LLMs dynamically direct their own processes and tool usage,” distinguishing them from workflows :...