Self-Improving Reward Models

meta-reward: reward modeling as harness optimization — Canvas

Agent trajectory

Evaluator harness: optimized by meta-reward

Trace view evidence shown to the judge

System prompt rubric + policy checks

Control logic score procedure + validation

Scoring interface structured output + calibration

Frozen LLM judge

Reward score

We propose meta-reward : a method for improving agent reward models by optimizing the evaluator harness around a fixed LLM judge. Agent task performance is only as good as the reward signal it optimizes against. Traditionally, reward models are trained to score model outputs using preferences, ratings, or verifier labels. But agents are harder to evaluate: an effective reward model needs to judge an agent's full trajectory, not just its final output. LLM judges provide a scalable way to score those trajectories. They can read full agent traces and convert trajectory-level behavior into a reward signal. A raw LLM judge, however, is still an underspecified reward model. To become reliable, it needs a surrounding system to ground the reward signal with what evidence it sees, what rubrics it applies, and the procedure that turns its judgment into a score. In our work, we pose this surrounding system as the evaluator harness and optimize it directly. On τ³-bench airline , optimizing the harness around a fixed Haiku 4.5 judge raises held-out agreement from 52.8% to 78.2% and improves natural best-of-N trajectory selection by up to +30.2 points .

Pointwise reward agreement on held-out agent rollouts

Baseline harness

Tuned harness

τ³-bench airline

Haiku 4.5 · Optimized harness with policy checks and structured scoring

78.2%+25.4 pts

Haiku 4.5 · Default evaluation procedure

52.8%

Plan-RewardBench

Opus 4.6 · Optimized harness with task-specific rubric

72.4%+11.9 pts

Opus 4.6 · Default evaluation procedure

60.5%

Motivation The core challenge in agent post-training is defining a reward signal that captures the behavior we want the agent to learn. In domains with verifiable outcomes, this is relatively clean. Math solutions can be checked deterministically and code can be evaluated with executable tests. In cases where automatic verification isn't available, reward signals are often constructed from human judgment. For long-horizon agent tasks, reward specification is harder because the reward must judge the full trajectory, not just the final response. A customer support, research, or workflow agent is evaluated by what information it gathered, which tools it called, what policy it applied, when it changed external state, and when it chose not to act. Without accurate trajectory-level supervision, we risk rewarding the right outcome for the wrong reasons. For example, an agent might reach the correct final state through a lucky guess, unnecessary tool use, or an unauthorized action. We observed this in τ³-airline as an action bias : the untuned judge often over-rewarded visible state-changing actions, like cancellations, compensation, and booking changes, even when policy required restraint. Human annotation can provide this trajectory-level supervision, but labeling full agent traces is slow and expensive to scale. LLM judges offer a more scalable approximation. They can read agent traces, evaluate behavior against task criteria, and turn that judgment into a reward signal. In our previous meta-agent work, we used LLM judges to score unlabeled agent traces during harness optimization. Essam Sleiman@essamsleiman · Apr 6 meta-agent: continual learning for agents We built meta-agent: an open-source library that automatically and continuously improves agent harnesses from production traces. Point it at an existing agent, a stream of unlabeled production traces,… But a judge call is not yet a reward procedure. Given a long trace and a rubric, the judge still has to infer what evidence matters, which constraints to prioritize, how to handle conflicting signals, and how to turn its reasoning into a score. Those choices determine what behavior gets rewarded. We call the system that specifies these choices the evaluator harness . It defines the trace view, policy context, checks, rubric, decision process, and scoring logic around the judge. meta-reward optimizes the evaluator harness directly. Using a small set of trusted trajectory preferences from human annotation or task-specific labels, it tunes the evaluation procedure so the judge's scores better align with trusted preferences and generalize to unseen trajectories. How it works meta-reward keeps the judge model parameters fixed and optimizes the evaluator harness around it: the trace view, policy context, checks, rubric, decision process, and scoring logic. We start with a small set of trusted trajectory preferences. Each example contains two agent trajectories and a label for which trajectory should receive higher reward. The evaluator scores each trajectory 0–100 independently, then our system predicts the preference by choosing...

Self-Improving Reward Models

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast