Propel: Breaking the Solver Bottleneck in Task-Generator RL

AMavorParker1 pts0 comments

PROPEL: Breaking the Solver Bottleneck in Task-Generator RL

PROPEL: Breaking the Solver Bottleneck in Task-Generator RL

AuthorsLorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent

DescriptionPROPEL replaces solver trials with an activation probe on a frozen generator's internals, doubling frontier-task generation across math, code, and SWE.

DateJune 10, 2026<br>AffiliationsVmax, Goodfire AI, Vmax, Vmax, Vmax, Vmax, Vmax

The Solver Bottleneck

Reinforcement learning on verifiable rewards has carried the current generation of reasoning and agentic models, but progress under this recipe is gated by task supply. As policies improve, fixed task distributions saturate; further gains require harder tasks that remain discriminative at the capability frontier. Hand-curated benchmarks cannot keep pace, and naive synthetic generation yields tasks that are trivially solvable or ill-posed.

The approach we are taking at Vmax is to train an open-ended task generator with RL, rewarding it for tasks that are valid and appropriately difficult for a target solver , the model or agent that attempts them. In this work only the generator is trained and the solver stays fixed; the goal is to make exactly this training step tractable. In previous work we co-evolve teacher and student through asymmetric self-play; a version of PROPEL that fully closes that loop is future work.

Training task generators is difficult because a reliable estimate of the solve rate requires many solver attempts. In agentic settings this is prohibitive. A single SWE rollout involves repository navigation, tool calls, and test execution, and can take tens of minutes. Embedding many such rollouts per candidate inside generator RL is intractable whenever the verifier is an expensive stochastic agent; in math and code the trials are cheaper, but the cost and variance remain.

PROPEL , Probe Rewards for Optimizing Problems at the Edge of Learning, breaks this bottleneck. A small activation probe, trained once on solver-labeled tasks, predicts the target solver's pass rate from a single forward pass through a frozen reference model. This probe replaces the solver in the RL loop (Figure 1). Across math, code induction, and software engineering (including a 27B agent on unseen repositories), the probe-trained generator produces tasks at the learnable frontier at roughly double the rate of the base generator.

Figure 1. Solver-in-the-loop reward vs PROPEL reward. The solver-in-the-loop baseline pays K solver trials per candidate task on every RL step; PROPEL scores each candidate with a single forward pass through a frozen reference model and a probe, with zero solver trials in the loop.

What Makes a Good Task Generator?

We evaluate generators on three axes:

Validity : the fraction of generations that are well-formed (programs that parse and run).<br>Utility : the fraction at the learnable frontier, i.e. with solve rate inside a target band.<br>Diversity : coverage of the task space rather than repetition of a few patterns.

Utility uses a strict band. A task is useful if its mean solve rate over K solver attempts lands inside a target band:

US(x)  =  I[ a ≤ μS(x)⏟mean solve rate ofsolver S on task x ≤ b ]U_S(x) \;=\; \mathbb{I}\Big[\, a \,\le\, \underbrace{\mu_S(x)}_{\substack{\text{mean solve rate of} \\ \text{solver } S \text{ on task } x}} \,\le\, b \,\Big]US​(x)=I[a≤mean solve rate ofsolver S on task x​μS​(x)​​≤b]

For math and AZR the band is 1–3 successes out of 8 attempts (a=1/8, b=3/8 ); for SWE, the optimal solve rate band is from a=1/3 to b=2/3. Tasks the solver always fails are too hard, tasks it always solves are saturated; only the middle band earns credit. The probe is trained against this binary label, and every reported result re-scores fresh generations with the actual solver; the probe score is never the reported outcome.

PROPEL

PROPEL replaces in-the-loop solver rollouts with a single forward pass through a frozen reference model and a small activation probe. We collect a one-time labeled corpus of (task, solver-outcome) pairs and train the probe on the reference model's hidden states to predict the target solver's pass rate; gated by a cheap validity check, the probe then serves as the RL reward for the task generator (Figure 2).

Figure 2. The two phases of PROPEL. Phase 1, run once: generated tasks are labeled with solver outcomes and the probe is trained on frozen-reference activations. Phase 2: generator RL against the probe, with no solver in the loop.

This exploits a well-documented property of language models: quantities of interest are often represented internally even when the model cannot act on them reliably at generation time. If a task's well-formedness, solvability, and difficulty are decodable from hidden states, the probe provides a dense, near-free reward long before any solver rollout would confirm it.

The probe reads from a frozen reference...

solver task probe generator propel rate

Related Articles