Even (very) noisy LLM evaluators are useful for improving AI agents

GabrielBianconi2 pts0 comments

Even (very) noisy LLM evaluators are useful for improving AI agents · TensorZero Even (very) noisy LLM evaluators are useful for improving AI agents<br>May 12, 2026 &middot; Alan Mishler<br>Summary<br>LLM evaluators are often noisy and weakly correlated with real-world<br>outcomes.

Noisy evaluators have limited value for production decisions that hinge on<br>judging a single output (e.g. guardrails).

However, even (very) noisy evaluators can reliably tell you which agent is<br>better on average, meaning they can still help you pick the best variant<br>to deploy and improve it over time.

It’s surprisingly hard to develop reliable LLM evaluators: they’re often noisy and poorly correlated with the metrics or outcomes practitioners actually care about.<br>Sometimes the target is directly measurable but evaluators still disagree with experts (e.g. on correctness or faithfulness to a source document).<br>Other times the target is only accessible through a proxy (e.g. whether code that passes tests satisfies user needs).<br>And sometimes the target is hard to observe at all (e.g. whether a customer was actually happy with an interaction).

Why is it so hard to develop reliable LLM evaluators?Rule-based and classical NLP metrics are often brittle and miss the semantic dimensions that matter.1, 2<br>Learned reward models are vulnerable to distribution shift3 and reward hacking.4<br>Studies of LLM-as-a-judge setups have repeatedly documented systematic biases and limitations: judges are heavily swayed by surface-level style,5 prefer longer responses to shorter ones of similar quality,6 are inconsistent across repeated evaluators and minor prompt variations,7 often align poorly with human judgments,8 and may correlate weakly with the downstream outcomes they’re meant to predict.9

An evaluator’s quality can be measured at two granularities:

Output-level correlation measures how well its score on individual outputs matches real-world outcomes. It governs production workflows (e.g. guardrails), where decisions hinge on individual outputs and noisy evaluators are unreliable. We’ll call an evaluator noisy with respect to a metric or outcome of interest if its output-level correlation is low.

Agent-level correlation measures how well its average over many outputs matches an agent’s real-world quality. It governs offline variant selection (e.g. picking the best prompt or model), and, unlike output-level correlation, it generally climbs with sample size as per-output noise averages out.

Even very noisy evaluators can be reliable for offline selection: enough to ship better agents today and keep improving them over time.

Why noisy evaluators can still rank agents

The key insight is that even a very noisy evaluator can yield scores that are higher on average for agents that truly are higher quality: the noise washes out over many samples.

To formalize this, suppose we have two agents we want to compare, AAA and BBB.<br>Let μA\mu_AμA​ and μB\mu_BμB​ represent the mean true scores for AAA vs BBB in the problem setting of interest, where true score refers to the thing we’d ideally want to measure, like how well the agent handled a customer’s query or whether it produced runnable code.<br>Suppose that higher scores are better.<br>Then we’d say that AAA is better than BBB if μA>μB\mu_A > \mu_BμA​>μB​.

Now suppose we have an evaluator whose scores can be regarded as noisy versions of the true scores.<br>Here are three hypothetical samples of true scores and evaluator scores for increasingly noisy evaluators:

Evaluator score<br>10.50

Slightly noisy

Moderately noisy

Very noisy

Hypothetical samples of true scores (x-axis) and evaluator scores (y-axis) for three evaluators with increasing noise. The dashed line marks y = x (a perfect evaluator).<br>The leftmost evaluator is accurate enough to judge individual outputs in production.<br>The rightmost isn’t: its verdict on any single output is too noisy to trust.

However, if we’re using an evaluator offline to choose between AAA and BBB, then we don’t need every individual value to be accurate.<br>We just need the evaluator to tell us which agent is better overall.<br>All three evaluators will do that, given sufficiently large evaluation samples.

Suppose Agent AAA has true-score mean μA=0.6\mu_A = 0.6μA​=0.6 and Agent BBB has μB=0.3\mu_B = 0.3μB​=0.3, so AAA is the better agent.<br>Below are the same scatterplots as in the figure above, but with each output now colored by which agent it came from.<br>Let μ^A\widehat{\mu}_Aμ​A​ and μ^B\widehat{\mu}_Bμ​B​ be the average evaluator scores for each agent, shown as horizontal dashed lines on each plot.<br>In all three initial samples, μ^A>μ^B\widehat{\mu}_A > \widehat{\mu}_Bμ​A​>μ​B​, meaning the evaluator correctly leads us to choose the better agent.

Agent AAgent Bμ^A\hat{\mu}_Aμ^​A​μ^B\hat{\mu}_Bμ^​B​<br>Draw new samples<br>Evaluator score<br>10.50

Slightly noisy

Moderately noisy

Very noisy

Samples drawn from the true-score distributions of Figure 2 (Agent A with mean 0.6, Agent B with mean 0.3; 30...

noisy evaluators agent evaluator scores agents

Related Articles