SharpeBench: A luck-robust benchmark for AI trading agents

Introducing SharpeBench · General Liquidity ← ResearchJune 22, 2026·Tiberiu Toca·12 min read

Give a thousand monkeys a quarter of market data and one of them will look like Renaissance. Rank traders by return over a short window and you are mostly ranking the variance of noise, not skill. Today we are open-sourcing SharpeBench, a luck-robust benchmark for AI trading agents. It is a single deterministic binary that takes any agent, in any language, and scores it not on how much it made, but on whether its edge is real. The crates are on crates.io and the methodology is below. We built it because the field is being measured on sand. A 2026 audit of nineteen LLM-trading studies found that zero reached full reproducibility, only two used time-consistent train and test splits, and exactly one modelled transaction costs. When the scoreboard rewards the luckiest run on the friendliest window, that is what the research optimises for. For a coding agent, a flattering benchmark wastes some time. For a capital agent it actively selects the strategy most likely to blow up, the one whose backtest looked best because it overfit hardest.

Leaderboards Rank Luck Finance has known for a long time that performance and skill are not the same thing. When Barras, Scaillet and Wermers applied a false-discovery correction to two decades of mutual funds, they found that roughly three quarters had no genuine alpha at all, and that most of the apparent winners were false positives. Fama and French reached the same conclusion with a different method: once you account for the sheer number of funds, very few beat their benchmark by more than luck would predict. The lesson is old and well tested. A track record is evidence, but a short one, drawn from a large pool of candidates, is weak evidence. The current generation of AI-trading benchmarks inherits none of this caution. FinBen, one of the most cited, ranks GPT-4 first on its trading task, yet its own reported Sharpe ratio is 1.51 with a confidence interval of plus or minus 1.08, a band running from about 0.43 to 2.59. An interval that wide cannot separate the agent it places first from the middle of the field. StockBench evaluates on a single four-month window and ranks on cumulative return and drawdown, with no repeated-run reliability and no deflation. QuantBench names overfitting as an open problem and then ranks pipeline performance with no luck-corrected metric to address it. None of this is surprising once you look at how easy the numbers are to manufacture. Bailey, Borwein, Lopez de Prado and Zhu showed that a high backtest Sharpe is trivially achievable by trying enough configurations: with only a handful of years of data, a few dozen variations are enough to produce a spurious two-plus Sharpe that vanishes out of sample. Harvey and Liu turned this into a practical correction, a Sharpe-ratio haircut that scales with how many strategies were tried. A benchmark that does not apply that haircut is not measuring skill. It is running an overfitting contest and handing the trophy to the best overfitter.

Skill That Survives Deflation The Sharpe ratio quoted everywhere is just the sample mean of returns over their standard deviation. It says nothing about how many strategies you tried before you found that one, how long the track is, or how fat-tailed the returns are. SharpeBench adds, as ranking gates, the corrections that fix exactly this. The first gate is the Deflated Sharpe Ratio of Bailey and Lopez de Prado. Its building block, the Probabilistic Sharpe Ratio, asks a sharper question than “how big is the Sharpe?”: given the track length and the return distribution’s skew and kurtosis, what is the probability that the true Sharpe exceeds a benchmark? The deflated version sets that benchmark to the Sharpe you would expect from the best of N random trials, so every additional strategy you try raises the bar the survivor must clear. The effect is easiest to see on a single track. Hold one agent’s returns fixed and ask what its deflated score is as the number of strategies that were tried before it grows. The score is near certainty when the track stands alone, falls through the rank-eligibility bar once it is the best of a few dozen, and collapses toward zero once it is the best of hundreds. The agent did not get worse. The context did. The same return track, scored against an ever-larger pool of trials. It is convincing as a one-off and indistinguishable from luck once it is the survivor of hundreds of attempts. Computed directly with the SharpeBench kernel formulas (skew and kurtosis set to normal for clarity).A single number, however well corrected, can still hide a strategy that works on most runs and detonates on the rest. So the second gate is reliability. SharpeBench borrows passk from Sierra’s agent-reliability work, and the distinction matters: pass@k asks whether an agent succeeds at least once in k tries, which flatters; passk asks whether it succeeds on every one of...

SharpeBench: A luck-robust benchmark for AI trading agents

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI