Surpassing Frontier Performance with Fusion

Surpassing Frontier Performance with Fusion — OpenRouter Blog Surpassing Frontier Performance with Fusion Brian Thomas · 6/12/2026

On this page Panels of Models Consistently Outperform on Deep Research One API call that fuses the best output of multiple models We chose DRACO to test reasoning, tool calling, and succinctness Preventing the Models from Cheating Significant boost from fusing a model with itself Notes on our DRACO implementation Give Fusion a try We’ve found that synthesizing the results of multiple models can significantly outperform what individual models are capable of. Introducing Fusion: a tool for getting these combined results just as easily as calling a single model. It allows you to choose a panel of participant models alongside a judge model responsible for fusing the individual results together.

To understand the benefits of Fusion, we used a deep research benchmark that tests the combination of reasoning, tool usage, and knowledge. We found that:

Panels consistently outperform individual models

Beyond-frontier performance can be achieved with frontier panels

Panels of budget models can surpass frontier models and get close to frontier panel performance

Try Fusion now in a chatroom, or check out the API docs to build it into your application.

Panels of Models Consistently Outperform on Deep Research

We tested Fusion on 100 deep research tasks from the DRACO benchmark. Some highlights of what we found:

Fable 5 + GPT-5.5 fused together scored 69.0%**, surpassing every individual model, including Fable 5 alone at 65.3%**.

A budget panel (Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro) beat GPT-5.5 and Opus 4.8. It came within 1% of Fable 5’s score while being 50% of the cost.

TypeModel(s)ScoreFusionFable 5 + GPT-5.5** synthesized by Opus 4.869.0% FusionOpus 4.8 + GPT-5.5 + Gemini 3.1 Pro synthesized by Opus 4.868.3% FusionOpus 4.8 + GPT-5.5 synthesized by Opus 4.867.6% FusionOpus 4.8 + Opus 4.8 synthesized by Opus 4.865.5% SoloClaude Fable 5**65.3%FusionGemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro synthesized by Opus 4.864.7% SoloDeepSeek V4 Pro60.3%SoloGPT-5.560.0%SoloClaude Opus 4.858.8%SoloKimi K2.653.7%SoloGemini 3.1 Pro45.4%SoloGemini 3 Flash43.1% ** 7 of the 100 DRACO tasks were not completed because Fable 5’s content filters blocked them from executing. We chose not to fall back to Opus 4.8 for those tasks, so the Fable results reflect 93 scored tasks rather than the full 100. This gives the most accurate picture of Fable’s own performance, but means direct score comparisons against models that completed all 100 tasks are slightly uneven.

We believe this demonstrates the benefits of model diversity, similar to the benefits seen on human team performance. Bringing multiple different perspectives to complex problems yields superior results.

One API call that fuses the best output of multiple models

When you send a prompt to Fusion, we dispatch it to a panel of models in parallel, each with web search and web fetch enabled. A judge model reads every panel response and produces structured analysis: consensus points, contradictions, partial coverage, unique insights, blind spots. The calling model then writes the final answer grounded in that analysis.

The whole pipeline runs server-side so it can be called just like you would an individual model.

Call Fusion directly with a single model slug:

"model": "openrouter/fusion", "messages": [ { "role": "user", "content": "What are the strongest arguments for and against carbon taxes?" } Or customize the panel:

"model": "openrouter/fusion", "messages": [{ "role": "user", "content": "..." }], "plugins": [{ "id": "fusion", "model": "google/gemini-3-flash-preview", "analysis_models": [ "google/gemini-3-flash-preview", "moonshotai/kimi-k2.6", "deepseek/deepseek-v4-pro" }] We chose DRACO to test reasoning, tool calling, and succinctness

We needed a benchmark that could tell the difference between a model that sounds thorough and one that actually is. Standard benchmarks test factual recall or reasoning puzzles. They don’t test the thing Fusion is built for: researching a complex question, synthesizing multiple sources, and producing a comprehensive, well-cited analysis.

DRACO (by Perplexity AI) is designed for this. It contains 100 deep research tasks spanning 10 domains: academic research, finance, law, medicine, technology, UX design, general knowledge, needle-in-a-haystack retrieval, personalized assistance, and product comparison.

Each task comes with a rubric of roughly 39 weighted criteria across four categories:

Factual Accuracy (~20 criteria): verifiable claims the response must get right

Breadth & Depth (~9 criteria): synthesis quality, trade-off analysis, actionable guidance

Presentation Quality (~6 criteria): terminology, formatting, readability

Citation Quality (~5 criteria): primary source citations with working references

Criteria can carry negative weights. Meeting a negative criterion means the...

Surpassing Frontier Performance with Fusion

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y