Agent Judge: Solving Long-Context Evals for Production Agents — Judgment Labs<br>Announcing $32M in funding led by Lightspeed.View Announcement
Log inBook demo
Announcing $32M in funding led by Lightspeed.View Announcement
Blogs/Research<br>Agent Judge: Solving Long-Context Evals for Production Agents<br>Why production agent evals need agentic judges that can search, verify, and adapt.<br>Rishi Gujjar, Andrew Li·May 27, 2026·8 min read
Moving Away From Simple LLM Judges
Most teams evaluate agent trajectories with a simple LLM judge approach: give the judge the user query, final agent output, perhaps some metadata, and a rubric. Then, ask whether the agent behaved as intended.
As the industry moves toward long-horizon agents that autonomously perform tasks end-to-end, LLM judges fail to consistently produce accurate evals. For instance, a sales agent may research leads, update a CRM, send an email, and book a meeting before it returns a final message. Or a coding agent may edit dozens of files, update an AWS config, and open a GitHub PR.
In both cases, a basic LLM judge breaks down: it cannot fit the full agent trajectory into its context window, and it cannot verify stateful changes against source-of-truth systems such as Google Calendar, a CRM, AWS, or GitHub. As a result, the effectiveness of automated evals falls apart. Agent failures slip through undetected, customer dissatisfaction persists, and teams default back to manual review of agent trajectories.
LLM judges break down on long-horizon agents for three reasons:
Long trajectories. Long-horizon agents can span hundreds of tool calls across databases, services, documents, and other systems. Coding agents like Codex and Claude Code can run for long horizons because they compact context as they work. That lets their trajectories extend into millions of tokens, far beyond what an LLM judge can fit into a single context window.
LLM Judges can only read a small window of a long trajectoryMost of the trajectory falls outside what the judge can read.What the Judge MissesInputRest of trajectoryOutputLLM Context WindowLong-horizon agent trajectories can exceed what an LLM judge can hold in context. Pasting the whole trace into one prompt may fail outright; truncating or slicing it leaves important parts unread.
Stateful actions. Production agents do more than generate text. They query databases, call APIs, update records, send messages, and trigger workflows. A background sales agent might update the status of your leads, and the evaluator has to look into the CRM to verify that the change was reflected.
Only Agent Judge can reach the systems that hold the truth.LLM Judges only see the trajectory.Agent Judge inspects the systems where production state lives.LLM JudgeGitHubAWS IAMSecrets ManagerCloudWatchAgent JudgeGitHubAWS IAMSecrets ManagerCloudWatchAn LLM Judge only sees the trajectory, not the corresponding environment, so stateful changes go unverified. Agent Judge queries the same systems the agent acted on and checks whether the action actually happened.
Changing behavior. Models, tools, and user workflows evolve as AI systems improve. An evaluation rubric that worked last month may go stale, miss new failure modes, over-penalize improved behavior, or keep looking for evidence in the wrong place. In production, the rubric has to evolve with the distribution of your queries and changes to your agents so the evaluator stays accurate and useful.
The rubric stays still. The agent does not.A fixed rubric defines a tolerance band. Production behavior drifts out of it.Week 1Week 10RubricBehaviorChanges to the underlying models, tools, and user workflows change how the agent behaves. A fixed rubric keeps grading against old criteria, so the judge stops catching relevant new failure modes.<br>Evaluations are no longer a judgment of the final answer. It is an investigation of the entire trajectory. The evaluator has to inspect what the agent saw, did, changed, and relied on.
Agent Judge
We designed Agent Judge as an agentic evaluation harness to handle these three failure modes through three distinct capabilities: Search, Verification, and Adaptation.
Search: handles long trajectories by making them navigable, so buried evidence can be found without manual trace review.
Verification: handles stateful actions by checking tool evidence and environment state, so the eval checks the effect of agent actions.
Adaptation: handles dynamic human and agent behavior by comparing evaluations against human feedback and production signals across many production trajectories, so rubrics can evolve as the agent, tools, and product change.
Agent Judge runs as a multi-agent system: reader agents inspect targeted evidence, spawned worker agents split the search or verification work, and forked agents pursue new questions raised by the first pass.
Search
In long trajectories, failure modes are subtle and rarely live in one place. Mistakes can originate from an early...