Benchmark retention is not utility retention<br>© 2026<br>I want to flesh out the point I made here about benchmark accuracy being a bad metric for evaluating model routing.<br>Users of model routing care about utility retention, not accuracy retention.<br>Let’s model the problem as utility retention rather than accuracy retention.<br>Routing benchmarks typically report a metric of the form:<br>Performance Retention=Router AccuracyBest Model Accuracy.\text{Performance Retention} = \frac{\text{Router Accuracy}}{\text{Best Model Accuracy}}.Performance Retention=Best Model AccuracyRouter Accuracy.<br>For example, if the strongest model solves 100 benchmark tasks and a router solves 99 of them, the router is said to achieve 99% of the best model’s performance.<br>This implicitly assumes that every task carries equal value. Formally, if the benchmark contains tasks t1,…,tnt_1,\ldots,t_nt1,…,tn, benchmark accuracy is<br>1n∑i=1n1{task ti solved correctly}.\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\{\text{task } t_i \text{ solved correctly}\}.n1i=1∑n1{task ti solved correctly}.<br>This is equivalent to assigning every task a value of one.<br>In deployment, however, tasks have heterogeneous importance. Let<br>V(t)V(t)V(t) denote the value of solving task ttt,<br>L(t)L(t)L(t) denote the loss incurred by failing task ttt,<br>C(m,t)C(m,t)C(m,t) denote the inference cost of running model mmm on task ttt.<br>Then the relevant objective is not accuracy but expected utility:<br>U(m)=Et∼D[V(t)1correct−L(t)1incorrect−C(m,t)],U(m) = \mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(m,t)\right],U(m)=Et∼D[V(t)1correct−L(t)1incorrect−C(m,t)],<br>where DDD is the real-world distribution of tasks.<br>The routing problem is therefore<br>maxR Et∼D[V(t)1correct−L(t)1incorrect−C(R(t),t)],\max_{R}\;\mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(R(t),t)\right],RmaxEt∼D[V(t)1correct−L(t)1incorrect−C(R(t),t)],<br>where R(t)R(t)R(t) is the model selected by the router.<br>Benchmark retention estimates<br>E[1correct],\mathbb{E}[\mathbf{1}_{\text{correct}}],E[1correct],<br>while deployment performance depends on<br>E[V(t)1correct−L(t)1incorrect].\mathbb{E}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}}\right].E[V(t)1correct−L(t)1incorrect].<br>These coincide only in the special case where all tasks have identical value and identical failure costs.<br>In general, task importance is heavy-tailed. A router that achieves 99% benchmark retention may retain substantially less than 99% of deployment utility if the omitted 1% of tasks contains a disproportionate share of real-world value.<br>There is no monotonic relationship between benchmark accuracy retention and utility retention.<br>Consider a benchmark of nnn tasks. Suppose task t1t_1t1 carries value MMM, while each remaining task carries value 111. The strongest model solves all tasks, while the router fails only on t1t_1t1.<br>Then benchmark retention is<br>n−1n,\frac{n-1}{n},nn−1,<br>which approaches 111 as n→∞n \to \inftyn→∞.<br>However, utility retention is<br>n−1M+n−1,\frac{n-1}{M+n-1},M+n−1n−1,<br>which approaches 000 as M→∞M \to \inftyM→∞.<br>Thus a router can achieve arbitrarily high benchmark retention while retaining arbitrarily little real-world utility.<br>It’s true our example assumes the router misses the most important task. But I claim merely that benchmark accuracy retention is decoupled from deployment utility retention, which doesn’t depend on the router necessarily missing the important task, only that it could miss important tasks.