Quantifying LLM Cost Savings from Cache-Aware Inference Routing

Quantifying LLM Cost Savings from Cache-Aware Inference Routing | Auriko

Open menu<br>Full technical report<br>Quantifying LLM Cost Savings from Cache-Aware Inference Routing<br>A Statistical Report<br>svg]:px-3 landing-cta-outline" href="/blog/cost-optimization">See how Auriko reduces costsvg]:px-3 landing-cta" href="/playground">See cost optimization live

The same LLM session, even with identical model, parameters, prompts and responses, can incur substantially different costs depending on the inference provider that serves it. The gap comes from token pricing, provider prompt-caching behavior, and users’ request patterns. When those factors diverge, a cache-aware cost-arbitrage layer can reduce spend. The question is how much it saves and how reliably.

We benchmarked Auriko's cache-aware cost-arbitrage engine against five inference providers and an inference routing peer. The benchmark sent over 80,000 API requests to 37 models across 3 workload types. These requests form over 22,000 sessions. Auriko reduced dollar-weighted spend against six tested baselines: 32.8% against the routing peer (95% CI [30.6%, 34.9%]) and 7.7–38.3% across five single-provider baselines.

This report presents the benchmark results, explains the cost drivers behind the savings, and documents the methodology, statistical tests, and robustness checks.

The Opportunity: Cost Dispersion Across Providers

Inference cost for the same requested model varies across providers. For some models we tested, the most expensive path costs 4x the cheapest for an identical request.

We compared 37 models across providers. For each model, we divided each provider's average cost by the lowest-cost provider's average cost and multiplied by 100. A score of 100 means cheapest. A score of 300 means three times the cost for the same request.

The variation appears across models in the study. The pattern holds across workload types. Coding agent workloads show the widest gap; single-turn and conversation workloads show comparable spreads (details in Robustness and Sensitivity).

This cost dispersion means that within the tested provider set, a cheapest provider choice exists for each measured workload. The rest of this report presents a statistical study of the cost reduction observed for Auriko's cache-aware routing approach. The next section describes how we conducted the study.

Experimental Design

This study adopts a matched-pairs design. For each model and workload combination, every call to Auriko and each comparator uses identical prompts. Request parameters — temperature, top-p, frequency penalty, presence penalty, and maximum output length — are held constant across all calls. Both sides are dispatched concurrently to control for time-of-day effects. Each combination is repeated multiple times. If either side of a pair errors, both sides are excluded before analysis.

Coverage

DimensionCoverageComparators1 routing peer + 5 inference providersModels37 modelsWorkloadsSingle-turn, multi-turn conversation, and multi-step coding agent sessionsRequests made80,634 API requestsLLM sessions22,416 sessionsRequested paired sessions10,752Clean paired target comparisons9,594Benchmark window2026-05-28 to 2026-06-07 UTC

Requests made is the total number of individual API calls. LLM sessions group related calls into workload sessions, including multi-turn conversations and multi-step agent runs. Requested paired sessions are target-level session pairings before error exclusion; clean paired target comparisons are the same pairings after symmetric exclusion when either side errored.

Metric

Cost is the dollar amount returned by each API response. For session-level analysis, request costs are summed within each workload session. For the routing peer, cost is adjusted for their platform fee, matching what a user of that service effectively pays:

adjusted cost = API-returned cost × (1 + platform fee %)

Since Auriko charges no platform fees, no cost adjustment is needed.

Statistical Methods

MethodPurposeStratified bootstrap (95% CI, 5,000 replicates, model + scenario strata)Confidence intervalsSign test (one-sided)Directional evidence at the session levelWilcoxon signed-rank test (one-sided)Paired magnitude testHolm–Bonferroni correctionMultiple-comparison control on model-level testsEffect size (Hedges' g)Standardized magnitude measureWin rate inference (binomial + Wilson CI)Session-level directional evidence

Validity Controls

ControlWhat it checksSymmetric exclusionSessions with errors on either side are removed from both sides, preventing one-sided inflationOutput-volume normalizationOutput costs scaled to match between sidesOutput-volume filteringRestricted to sessions where both sides produced similar output (±10%)Error-inclusion testRe-including error sessions to test whether exclusion materially changes the result

The Evidence: Cost Reduction

The dispersion section showed that the same requested model can cost materially different amounts on different...

Quantifying LLM Cost Savings from Cache-Aware Inference Routing

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi