Co-Failure Ceiling on Mixture-of-Agents Across 67 Frontier Models

Paper page - When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Combining LLMs Rarely Beats the Single Best Model: A Provable Co-Failure Ceiling Across 67 Frontier Models\n","updatedAt":"2026-06-26T11:04:33.215Z","author":{"_id":"64442f46af034cdfd69d5bc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/S1CIatxL_H5QWDA6fgOop.png","fullname":"Josef Chen","name":"josefchen","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5974400639533997},"editors":["josefchen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/S1CIatxL_H5QWDA6fgOop.png"],"reactions":[],"isReport":false}},{"id":"6a410cd48fbfc742ffa9ab12","author":{"_id":"658412f93a84a40185adaf37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658412f93a84a40185adaf37/FKXH7e1jj09KO1v-B5sER.jpeg","fullname":"Aamer Mihaysi","name":"O96a","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-28T12:00:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The focus on 'co-failure' rates in multi-model systems is a necessary reality check for the MoA and routing hype. We often obsess over pairwise correlation, but if the entire ensemble hits the same wall on a specific query, no amount of voting or cascading will save the output. This 'beta' metric provides a concrete ceiling for performance that's far more useful for production planning than average accuracy gains. It shifts the conversation from 'which model is better' to 'where do they all fail,' which is where the actual engineering work begins. I'm interested to see if this co-failure rate correlates with specific task types or if it's a general property of the training data overlap across frontier models.","html":"The focus on 'co-failure' rates in multi-model systems is a necessary reality check for the MoA and routing hype. We often obsess over pairwise correlation, but if the entire ensemble hits the same wall on a specific query, no amount of voting or cascading will save the output. This 'beta' metric provides a concrete ceiling for performance that's far more useful for production planning than average accuracy gains. It shifts the conversation from 'which model is better' to 'where do they all fail,' which is where the actual engineering work begins. I'm interested to see if this co-failure rate correlates with specific task types or if it's a general property of the training data overlap across frontier models.\n","updatedAt":"2026-06-28T12:00:20.427Z","author":{"_id":"658412f93a84a40185adaf37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658412f93a84a40185adaf37/FKXH7e1jj09KO1v-B5sER.jpeg","fullname":"Aamer Mihaysi","name":"O96a","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.91756671667099},"editors":["O96a"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/658412f93a84a40185adaf37/FKXH7e1jj09KO1v-B5sER.jpeg"],"reactions":[{"reaction":"❤️","users":["josefchen"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.27288","authors":[{"_id":"6a3e45b63b43e283349ec6fa","user":{"_id":"64442f46af034cdfd69d5bc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/S1CIatxL_H5QWDA6fgOop.png","isPro":true,"fullname":"Josef Chen","user":"josefchen","type":"user","name":"josefchen"},"name":"Josef Chen","status":"claimed_verified","statusLastChangedAt":"2026-06-27T15:23:05.366Z","hidden":false}],"publishedAt":"2026-06-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-26T00:00:00.000Z","title":"When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models","submittedOnDailyBy":{"_id":"64442f46af034cdfd69d5bc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/S1CIatxL_H5QWDA6fgOop.png","isPro":true,"fullname":"Josef Chen","user":"josefchen","type":"user","name":"josefchen"},"summary":"Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample...

Co-Failure Ceiling on Mixture-of-Agents Across 67 Frontier Models

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7