Benchmark accuracy retention is the wrong metric

Benchmark retention is not utility retention © 2026 I want to flesh out the point I made here about benchmark accuracy being a bad metric for evaluating model routing. Users of model routing care about utility retention, not accuracy retention. Let’s model the problem as utility retention rather than accuracy retention. Routing benchmarks typically report a metric of the form: Performance Retention=Router AccuracyBest Model Accuracy.\text{Performance Retention} = \frac{\text{Router Accuracy}}{\text{Best Model Accuracy}}.Performance Retention=Best Model AccuracyRouter Accuracy. For example, if the strongest model solves 100 benchmark tasks and a router solves 99 of them, the router is said to achieve 99% of the best model’s performance. This implicitly assumes that every task carries equal value. Formally, if the benchmark contains tasks t1,…,tnt_1,\ldots,t_nt1,…,tn, benchmark accuracy is 1n∑i=1n1{task ti solved correctly}.\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\{\text{task } t_i \text{ solved correctly}\}.n1i=1∑n1{task ti solved correctly}. This is equivalent to assigning every task a value of one. In deployment, however, tasks have heterogeneous importance. Let V(t)V(t)V(t) denote the value of solving task ttt, L(t)L(t)L(t) denote the loss incurred by failing task ttt, C(m,t)C(m,t)C(m,t) denote the inference cost of running model mmm on task ttt. Then the relevant objective is not accuracy but expected utility: U(m)=Et∼D[V(t)1correct−L(t)1incorrect−C(m,t)],U(m) = \mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(m,t)\right],U(m)=Et∼D[V(t)1correct−L(t)1incorrect−C(m,t)], where DDD is the real-world distribution of tasks. The routing problem is therefore max⁡R Et∼D[V(t)1correct−L(t)1incorrect−C(R(t),t)],\max_{R}\;\mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(R(t),t)\right],RmaxEt∼D[V(t)1correct−L(t)1incorrect−C(R(t),t)], where R(t)R(t)R(t) is the model selected by the router. Benchmark retention estimates E[1correct],\mathbb{E}[\mathbf{1}_{\text{correct}}],E[1correct], while deployment performance depends on E[V(t)1correct−L(t)1incorrect].\mathbb{E}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}}\right].E[V(t)1correct−L(t)1incorrect]. These coincide only in the special case where all tasks have identical value and identical failure costs. In general, task importance is heavy-tailed. A router that achieves 99% benchmark retention may retain substantially less than 99% of deployment utility if the omitted 1% of tasks contains a disproportionate share of real-world value. There is no monotonic relationship between benchmark accuracy retention and utility retention. Consider a benchmark of nnn tasks. Suppose task t1t_1t1 carries value MMM, while each remaining task carries value 111. The strongest model solves all tasks, while the router fails only on t1t_1t1. Then benchmark retention is n−1n,\frac{n-1}{n},nn−1, which approaches 111 as n→∞n \to \inftyn→∞. However, utility retention is n−1M+n−1,\frac{n-1}{M+n-1},M+n−1n−1, which approaches 000 as M→∞M \to \inftyM→∞. Thus a router can achieve arbitrarily high benchmark retention while retaining arbitrarily little real-world utility. It’s true our example assumes the router misses the most important task. But I claim merely that benchmark accuracy retention is decoupled from deployment utility retention, which doesn’t depend on the router necessarily missing the important task, only that it could miss important tasks.

Benchmark accuracy retention is the wrong metric

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs