Is Capability a Liability? More Capable Language Models Make Worse Forecasts

[2605.22672] Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

-->

Computer Science > Artificial Intelligence

arXiv:2605.22672 (cs)

COVID-19 e-print Important: e-prints posted on arXiv are not peer-reviewed by arXiv; they should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the field.

[Submitted on 21 May 2026]

Title:Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Authors:Nick Merrill, Jaeho Lee, Ezra Karger View a PDF of the paper titled Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most, by Nick Merrill and 2 other authors

View PDF HTML (experimental)

Abstract:We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2605.22672 [cs.AI]

(or arXiv:2605.22672v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.22672

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

Submission history From: Nick Merrill [view email] [v1] Thu, 21 May 2026 16:14:33 UTC (298 KB)

Full-text links: Access Paper:

View a PDF of the paper titled Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most, by Nick Merrill and 2 other authors View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.AI

next >

new recent | 2026-05

Change to browse by:

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Is Capability a Liability? More Capable Language Models Make Worse Forecasts

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play