Inverse Rubric Optimization: A testbed for agent science | Fulcrum
Inverse Rubric Optimization: A testbed for agent science | Fulcrum
Fulcrum
“It is important to draw wisdom from many different places. If you take it from only one place, it becomes rigid and stale.”
— Uncle Iroh
At Fulcrum Research, we study the performance and behavior of long-horizon agents. Although each task setting has its own specific structure, we believe it’s possible to find general principles of agent performance across settings, each contributing to a nascent agent science.
In this post, we motivate the difficulty of finding suitable settings for agent science and propose inverse rubric optimization (IRO) settings, in which an agent has to optimize the preferences of a blackbox judge it has variable access to. We observe these tasks induce rich behavior and smooth scaling. We find that frontier models effectively iterate and improve with more judge access but by default do not maximally use the resources provided to them. Notably, Fable 5 outperforms all models given smaller amounts of labels, but does not improve at the largest budget and plateaus around the level of Opus 4.6. We open source code here.
Testbeds for agent science
Studying the behavior of agents is challenging due to the variance and cost of long horizon tasks. Trajectory-level variance is often notoriously high due to the many non-deterministic choices made in a run, making it hard to estimate the impact of various methods. Often this variance is precisely high in the tasks with large action spaces. But those tasks are those that induce the exact complex behaviors we intend to study and intervene on in our experiments.
The challenge is then to find settings that require general kinds of capability and benefit from a broad range of strategies, like resource utilization, exploration, hypothesis testing, etc., while being smooth enough for research.
To remedy this, we look at toy settings that remain challenging, rich and smooth.
Inverse rubric optimization
In an IRO task, the agent being evaluated has the goal of learning the preferences of a black-box judge model, parametrized by some judging rubric. It submits a policy for generation, e.g. a prompt, a scaffold, etc. which is then used to generate domain samples which are evaluated by the judge. It has to learn and explore the judge preferences by studying its scores and submitting new attempts.
Optimizer<br>agent
policy
input
judge<br>hidden rubric
policy π
output
score<br>1 label per scored output, budget B
submit final policy π*, evaluated held-out
Fig. 1: An inverse rubric optimization task. The optimizer agent iteratively submits a policy (e.g. a generation prompt) that maps task inputs to outputs; a black-box judge scores each output against a hidden rubric, spending one label per score. The agent finally submits its best policy, which is evaluated on held-out inputs.
In IRO tasks we vary the budget of judge labels that the optimizer can sample. The dynamics of how the optimizer interacts and samples that resource, whether it tests hypotheses, whether it properly considers sampling noise, etc. constitute the main interest of the setting.
In this post, we study a specific set of IRO tasks specified by various black-box LLM poetry judges. We generated a set of poem topics, as well as various poetry rubrics that grade poems along various stylistic and textual features, guided by the style of various poets. The optimizer agent then iterates on a prompt that is used by a generator model to generate poems which are scored by the judge. Its goal is to hillclimb the judge’s mean score on the generated poems. Each of our rubrics grades a set of features additively with partial credit, so that partial stylistic discoveries move the score (see Appendix A.1 for the full rubric of our Milton judge). 1
In the poetry setting, the optimizer agent is given a submit_train_batch(generation_prompt, num_labels) tool that takes in a set of guidelines for poem generation, and can generate num_labels poems using train set topics that are then scored and returned to the optimizer. Each run has a label budget that it can use, and at some point submits a final prompt that is evaluated on the eval set. It can call the tool as many times as it wants until its budget has been consumed.
We observed that different judges define different optimization landscapes, in terms of the range of scores and the difficulty of the learning problem they pose to the optimizer (see Appendix A.3 for per-judge learning curves). To normalize the difference in scale, we normalize by the gap between the performance of a policy that sees the hidden rubric and a policy that has no guidance. This gap provides a natural scale to benchmark optimizer progress against, although the rubric-visible score is not necessarily an upper bound 2. We define the normalized score of the original mean judge score S as follows:
\[\hat{S} = \frac{S -...