Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era | Qiuyang Mang
TL;DR. Agents can spend test-time compute by trying, observing, and revising, so we ask whether their gains come from a better internal strategy or from something close to repeated sampling. We derive a simple Elo reference line: repeated sampling is linear in log test-time compute. In a 2022 two-week coding marathon, current agents plateau within 24 hours, while top humans keep improving over the official two weeks. The takeaway is that humans still do much better long-horizon test-time adaptation, and agent strategies have a lot of room to improve.<br>Agents Bring Intrinsic Test-Time Strategies<br>OpenAI’s o1 report showed that more test-time compute can improve model performance. Many papers followed, especially on verifiable tasks like code and math (Snell et al., Large Language Monkeys, Noam Brown’s recent post). The common plot is success rate versus the log number of trials, or log test-time compute. These curves often rise superlinearly before they saturate.
Coverage on MATH with an oracle verifier as the number of samples increases, from Large Language Monkeys.<br>These studies measure model performance under an external test-time strategy. The strategy is fixed outside the model: sample many candidate solutions, check them with a verifier, and report pass@k or coverage.<br>However, agents change this setup. During a run, an agent can try a solution, observe the result, reflect on what failed, and revise its next attempt. This raises the question we study: when an agent improves with more test-time compute, is it using a better test-time strategy, or is it mostly reproducing repeated sampling?
Repeated sampling is fixed outside the model, while an agent can use feedback inside the run.<br>A Simple Model for Test-Time Scaling<br>We first write down the simplest model behind the usual pass@k curves: repeated sampling. The model treats each attempt as an independent draw from the same continuous score distribution. For one task, let \(X\) be the score of one sample and let \(\tau\) be the threshold for success. Then one sample succeeds with probability<br>\[p = \operatorname{Pr}(X \geq \tau).\] With \(k\) independent samples, pass@k is<br>\[\operatorname{Pr}\left(\max_{1 \leq i \leq k} X_i \geq \tau\right) = 1 - (1 - p)^k.\] For a dataset with multiple tasks, the usual test-time scaling curve averages this quantity across tasks. This gives a curve of mean pass@k, or coverage, as a function of the number of samples.<br>However, this evaluation is awkward for agents. An agent usually stops once it solves the task, so it is not natural to keep asking for more independent samples after success. For open-ended tasks such as FrontierCS, we could instead compare runs by the task’s own score. But raw-score gains are hard to interpret. In circle-packing tasks studied by AlphaEvolve, improving the objective value from 1 to 2 can be trivial, while improving it from 2.35 to 2.36 can require a much harder improvement. The score number does not by itself tell us how much capability changed. We therefore want a comparison that asks a simpler question: when one candidate spends more test-time compute than another candidate on the same task, how often does it produce the better answer?<br>Following the same repeated-sampling model, this becomes a pairwise question. If one candidate gets \(k_a\) independent attempts and another gets \(k_b\) independent attempts, the pairwise win probability is<br>\[\operatorname{Pr}\left(\max_{1 \leq i \leq k_a} X_i > \max_{1 \leq j \leq k_b} X'_j\right) = \frac{k_a}{k_a + k_b},\] where \(X_i\) and \(X'_j\) are independent attempts on the same task.<br>Now comes the useful part. A Bradley-Terry model converts pairwise win rates into a one-dimensional strength scale. If candidate \(a\) has BT log-strength \(\theta_a\) and candidate \(b\) has BT log-strength \(\theta_b\), then<br>\[\operatorname{Pr}(a \text{ beats } b) = \frac{\exp(\theta_a)}{\exp(\theta_a) + \exp(\theta_b)}.\] We can see that, for repeated sampling, the pairwise win probability above is exactly matched by<br>\[\theta_a = \log k_a + c,\] for any constant \(c\).<br>Thus, repeated sampling is a test-time strategy whose Elo is linear in log test-time compute. This is super helpful because it gives us a reference line. To judge an agent’s intrinsic test-time strategy, we can plot its Elo curve as test-time compute increases and compare it to this line. If the curve is above, below, or close to linear, the agent’s strategy is better than, worse than, or equivalent to repeated sampling.
Repeated sampling gives a linear reference line in Elo versus log test-time compute.<br>Agents Struggle, Humans Do More Than Sample<br>With this reference line in hand, we can ask what happens in real long-horizon tasks. We compare agent trajectories against the repeated-sampling line, and we also ask: how do they compare to top humans working on the same tasks?<br>We study AtCoder Heuristic...