Agents are under-elicited: A case study in optimization tasks | Fulcrum
Agents are under-elicited: A case study in optimization tasks | Fulcrum
Fulcrum
“Knowing is not enough; we must apply. Willing is not enough; we must do.”
— Johann Wolfgang von Goethe
At Fulcrum Research, we study the performance and behavior of long-horizon agents. In our previous post, we introduced inverse rubric optimization (IRO): tasks where an agent must learn the preferences of a black-box judge under a label budget. These are LLM optimization tasks - where an agent iteratively optimizes a metric.
In this post, we study which general prompt and scaffold methods can improve performance in these LLM optimization settings, by intervening via prompt elicitation and scaffolds. We show that our methods roughly double performance by affecting how much models use their resources and how effective they are per resource.
This case study suggests agents are under-elicited by default, and simple methods can exploit this to yield substantial gains.
Elicitation ~doubles score at every budget
0.00
0.25
0.50
0.75
1.00
100
1k
10k
label budget<br>eval score
elicited<br>baseline<br>Elicitation makes agents use more resources more efficiently
0.00
0.25
0.50
0.75
1.00
100
200
1k
2k
10k
labels used (log scale)<br>latest train score<br>eval 0.47<br>eval 0.85
baseline: mean finish 2.6k, train 0.46<br>elicited: mean finish 9.8k, train 0.71<br>efficiency gap<br>propensity gap
Fig. 1: Elicitation roughly doubles eval score at every resource budget (top). At budget 10k (bottom) the elicited run, using handoff and prompting, climbs more steeply per label and runs far longer before stopping than the baseline. The resource here is the labels from the oracle judge model whose preferences are being learned, and the curves stop at the mean finish point of runs of the method. Click to replay.
Decomposing performance in LLM optimization
In an LLM optimization trajectory, the agent iteratively makes progress by submitting attempts and reasoning about results. The environment gives it a feedback metric it can call to check the quality of its work, like for example testing the speed of its code.
We call the feedback metric it has during its trajectory the train metric, as opposed to the ground truth score of its final submission. In IRO, the train metric is the judge-labeled scores on some batch of samples from the train set, whose size is chosen by the agent for each submission. The train metric is thus a noisy proxy for the eval score on its final submission.
LLM optimization naturally allows us to study the ability of models to use resources, which in this case is how much of the train metric the model is allowed to use.
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
100
200
300
400
labels used<br>normalized score
final eval: 0.91
best so far<br>latest train batch
Fig. 2: An IRO optimization trajectory. The dim line is each train batch’s mean judge score and the solid line is the best score so far. The dashed line is the run’s final eval score.
We can decompose each run’s performance into the following attributes (see Fig. 3):
efficiency : how well it uses its marginal resource, i.e., the slope of its trajectory.
propensity : how much of the resource it uses, how long it tries to keep going on its slope.
score<br>resource spent
budget
foregone progress
efficiency<br>slope per unit resource
propensity<br>where it stops
unused budget
Fig. 3: Decomposing a run's performance.
Methods
Our methods focus on varying the efficiency and propensity of a run.
All experiments in this post, unless they say otherwise, use Opus 4.6 as the optimizer, Haiku 4.5 as the poem generator, and Opus 4.6 as the judge. We use the same IRO tasks described in our previous post. Each condition runs every judge at several seeds; bars and points are means over runs and error bars are standard errors over runs. Plotted scores are on the normalized scale defined previously.
We test two families of general elicitation interventions.
Prompt interventions
The prompt interventions we try are not specific to the task, but generic notes for optimization problems.
We had the following hypotheses on what prompt information might affect performance:
Information on the tractability of the task.
Suggestions on how to iterate.
We test the following prompts:
learnability : we tell the model that the task is learnable, and high scores are possible: “Calibration: this hidden-judge task is learnable from train feedback. There exists a learnable generation prompt whose mean performance is near perfect (for example, around 9 on the judge’s scoring scale). If your current prompt is scoring around 4, that is far from the best learnable behavior.”
strategy : we give some iteration advice for the optimizer: “keep searching for qualitatively different hypotheses rather than settling for generic features; inspect high- and low-scoring examples, run controlled comparisons, validate promising discoveries...