Agents are under-elicited: A case study in optimization tasks

Agents are under-elicited: A case study in optimization tasks | Fulcrum

Fulcrum

“Knowing is not enough; we must apply. Willing is not enough; we must do.”

— Johann Wolfgang von Goethe

At Fulcrum Research, we study the performance and behavior of long-horizon agents. In our previous post, we introduced inverse rubric optimization (IRO): tasks where an agent must learn the preferences of a black-box judge under a label budget. These are LLM optimization tasks - where an agent iteratively optimizes a metric.

In this post, we study which general prompt and scaffold methods can improve performance in these LLM optimization settings, by intervening via prompt elicitation and scaffolds. We show that our methods roughly double performance by affecting how much models use their resources and how effective they are per resource.

This case study suggests agents are under-elicited by default, and simple methods can exploit this to yield substantial gains.

Elicitation ~doubles score at every budget

0.00

0.25

0.50

0.75

1.00

100

10k

label budget eval score

elicited baseline Elicitation makes agents use more resources more efficiently

0.00

0.25

0.50

0.75

1.00

100

200

10k

labels used (log scale) latest train score eval 0.47 eval 0.85

baseline: mean finish 2.6k, train 0.46 elicited: mean finish 9.8k, train 0.71 efficiency gap propensity gap

Fig. 1: Elicitation roughly doubles eval score at every resource budget (top). At budget 10k (bottom) the elicited run, using handoff and prompting, climbs more steeply per label and runs far longer before stopping than the baseline. The resource here is the labels from the oracle judge model whose preferences are being learned, and the curves stop at the mean finish point of runs of the method. Click to replay.

Decomposing performance in LLM optimization

In an LLM optimization trajectory, the agent iteratively makes progress by submitting attempts and reasoning about results. The environment gives it a feedback metric it can call to check the quality of its work, like for example testing the speed of its code.

We call the feedback metric it has during its trajectory the train metric, as opposed to the ground truth score of its final submission. In IRO, the train metric is the judge-labeled scores on some batch of samples from the train set, whose size is chosen by the agent for each submission. The train metric is thus a noisy proxy for the eval score on its final submission.

LLM optimization naturally allows us to study the ability of models to use resources, which in this case is how much of the train metric the model is allowed to use.

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

100

200

300

400

labels used normalized score

final eval: 0.91

best so far latest train batch

Fig. 2: An IRO optimization trajectory. The dim line is each train batch’s mean judge score and the solid line is the best score so far. The dashed line is the run’s final eval score.

We can decompose each run’s performance into the following attributes (see Fig. 3):

efficiency : how well it uses its marginal resource, i.e., the slope of its trajectory.

propensity : how much of the resource it uses, how long it tries to keep going on its slope.

score resource spent

budget

foregone progress

efficiency slope per unit resource

propensity where it stops

unused budget

Fig. 3: Decomposing a run's performance.

Methods

Our methods focus on varying the efficiency and propensity of a run.

All experiments in this post, unless they say otherwise, use Opus 4.6 as the optimizer, Haiku 4.5 as the poem generator, and Opus 4.6 as the judge. We use the same IRO tasks described in our previous post. Each condition runs every judge at several seeds; bars and points are means over runs and error bars are standard errors over runs. Plotted scores are on the normalized scale defined previously.

We test two families of general elicitation interventions.

Prompt interventions

The prompt interventions we try are not specific to the task, but generic notes for optimization problems.

We had the following hypotheses on what prompt information might affect performance:

Information on the tractability of the task.

Suggestions on how to iterate.

We test the following prompts:

learnability : we tell the model that the task is learnable, and high scores are possible: “Calibration: this hidden-judge task is learnable from train feedback. There exists a learnable generation prompt whose mean performance is near perfect (for example, around 9 on the judge’s scoring scale). If your current prompt is scoring around 4, that is far from the best learnable behavior.”

strategy : we give some iteration advice for the optimizer: “keep searching for qualitatively different hypotheses rather than settling for generic features; inspect high- and low-scoring examples, run controlled comparisons, validate promising discoveries...

Agents are under-elicited: A case study in optimization tasks

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews