New SOTA: TrustedRouter Fusion beats Fable and Frontier | TrustedRouter
We're hiring<br>We're looking for PhD researchers to join the team and work on exciting frontier problems.<br>Get in touch →
← TrustedRouter blog<br>New SOTA: TrustedRouter Fusion beats Fable and Frontier
2026-06-17 · TrustedRouter-Fusion-Draco on GitHub
DRACO: TrustedRouter Fusion beats Fable and Frontier<br>Score out of 100, same judge (gemini-3.1-pro, reasoning high). Higher is better.<br>TrustedRouter<br>OpenRouter<br>Frontier panel -> Opus fuser
70.6<br>Fable 5 + GPT-5.5
69.0<br>Opus + GPT-5.5 + Gemini
68.3<br>Opus + GPT-5.5
67.6<br>Opus + Opus
65.5<br>Fable 5 (solo)
65.3<br>Budget panel -> Opus fuser
64.7<br>GPT-5.5 (solo)
63.0<br>Budget panel -> Opus fuser
62.6<br>Frontier panel -> GPT-5.5 fuser
62.2
Frontier panel = gpt-5.5 + opus-4.8 + gemini-3-flash + kimi-k2.6 + deepseek-v4-pro (closed + open weights). 100 DRACO tasks, single judge pass.
Research is only worth as much as someone else's ability to run it again. Too much of AI has drifted the other way: the strongest results arrive as a single number in a post, produced by a model you cannot open, on a harness no one else can see, graded by a rubric that ships to nobody. You are asked to take it on faith. We are building TrustedRouter to be an AI lab that does open science the old way: open code, open results, nothing hidden. Our whole stack is radically open source — frontend and backend alike, Apache-2.0 licensed — and so is everything behind this benchmark. That is how a benchmark number earns trust: verifiability, not hype.
So we held ourselves to it. We set out to reproduce OpenRouter's Fusion result — that a panel of models, each writing its own answer with a final model synthesizing them, beats any single model on a hard research benchmark — and then to push past it. On DRACO, a hundred deep-research tasks graded against roughly forty weighted criteria each by gemini-3.1-pro, a diverse panel synthesized by Claude Opus 4.8 scores 70.6 . That is the state of the art, above OpenRouter's best published fusion of Fable 5 and GPT-5.5 at 69.0. Every prompt, every tool call, and every graded answer behind the number is published.
The result comes from the panel, and the panel is itself an argument for open weights. OpenRouter's strongest fusions paired two closed frontier models. Ours adds frontier open-weights models — DeepSeek V4 Pro and Kimi K2.6 — alongside GPT-5.5, Opus, and Gemini 3 Flash. Fusion works on disagreement: models that fail in different places, reconciled by a strong synthesizer. Open-weights models are trained on different data and disagree in different ways than a closed pair does, and the wider panel is what reaches the top.
The synthesizer carries most of that result. Hold the five-model panel fixed and change only the model that writes the final answer: Opus 4.8 scores 70.6, GPT-5.5 scores 62.2. Same reports, same judge analysis, same hundred tasks, eight points of swing from one decision. A larger panel behind a weaker synthesizer buys nothing, and which model fills that slot is its own ranking.
No single model comes near that on its own. Run each one through the same agentic loop with the same live tools, and the strongest of them lands seven points below the panel.
Solo modelTrustedRouterOpenRouter
GPT-5.563.060.0<br>Claude Opus 4.860.758.8<br>DeepSeek V4 Pro59.960.3<br>Kimi K2.650.153.7<br>Gemini 3.1 Pro47.445.4<br>Gemini 3 Flash41.143.1
The strongest solo reaches 63; the panel reaches 70.6. Assembling a frontier answer out of models that are each behind the frontier is the entire point.
DRACO is an agentic benchmark. The answers are not in any model's weights, so each model in the panel has to search the web, read the sources, and run the numbers itself; we give every one of them live tools and let it drive its own research. Those runs issued thousands of searches and fetches, and all of them sit in the published replays — none touching the benchmark's own hosts, so nothing was looked up that was meant to be worked out. The leakage guard lives in the open-source harness, and the audit is yours to re-run.
We ran all of it on TrustedRouter for the same reason we published the code. A benchmark sends your prompts and the documents you fetch through someone else's servers, and with most gateways you take their privacy on faith. TrustedRouter runs inside a Trusted Execution Environment (TEE), end-to-end encrypted: a sealed enclave the operator cannot read into, handling every request as an attested workload whose exact code is measured and published. You can pull the image digest, match it against the open source, and confirm the binary that saw your prompt is the one in the repository, with nowhere inside it to record anything. You check the privacy the way you check the score — by hand, against a hash.
We do not want you to trust our 70.6. Clone the repository — the harness, the tasks, the judge, the panel, and the raw run traces are all in it — point it at TrustedRouter,...