Cache hit rates of Inference are more meaningful than the headline costs

GodelNumbering1 pts0 comments

Cache hit rates of Inference are more meaningful than the headline costs - Dirac Posts

On this page

Back to Posts

Cache Hit Rates

Tl;Dr: Agents push the full conversation history into context every turn; hence, over a large number of turns, they are extremely read heavy, which in turn is why cache hit rates are an important factor. This post is an analysis of 60+ providers and their cache hit rates using 398 data points. All data sourced from openrouter.ai model pages.

Agentic workflows are different from most human-LLM conversations in one key characteristic: the number of turns on average are far higher.

Context processing over multi-turn conversation grows quadratically. Every turn passes the full conversation up until that point into context along with its own input, e.g. turn 100 will push everything up to turn 99 again in the context window. The LLM, on its end, will try to match the longest sequence it can to the caches it has available and build the rest of the conversation as newly input tokens, typically at 10x the price. So on a long 200k context conversation (which btw is a bad idea capability-wise even if you don't care about costs), if the model that costs $5 per million input tokens fails to hit any cache, you'll be charged $1 just for the input processing. Two things determine this:

1.<br>Cached input pricing - the headline metric everyone looks at.

2.<br>Cache hit rate - the hidden variable that nobody talks about.

I recently spent significant hours thinking there was a bug in Dirac that caused caches to break, only to find in the end that it was entirely due to Gemini Flash 3's cache hit rate.

While trying to look up the data on this, I found that OpenRouter fortunately publishes this data (go to model's page and look for 'Effective Pricing' section). Since the data is hourly, we have to assume that it doesn't change too much hour to hour.

Provider Cache-hit Tier list

Providers with multiple endpoints (e.g. Amazon Bedrock US, Bedrock Global, Bedrock (1)) are listed separately — each entry reflects the hit rate of that specific endpoint as observed.

DeepSeek remains the gold standard of caching, which probably doesn't surprise anyone who has used their official API. In fact, all S-tier entries (hitting 75%+ cache rates) are Chinese labs: DeepSeek (87%) , StepFun (86.1%) , Moonshot AI (84.8%) , MiniMax (75.4%) , and Xiaomi (74.7%) .

The mainstream US labs place somewhere in the middle but as we will see in the next section, the variance is huge and rather interesting.

On the flip side, we have the "F-Tier". Providers like io.net, AkashML, SambaNova, and Nebius are clocking in at exactly 0.0% cache hit rates across the models.

US closed-source big 3

The most interesting thing to me from the chart above is, older models from the same provider tend to get lower cache hit rates. If I had to explain it non-cynically, I would guess that from systems engineering POV, it's probably the cache-pool sizes allocated to each model.

Google does worse than the two other providers across the board, especially considering that they own the full stack on TPUs. This gets full-on clowny when you look at the Vertex AI numbers (see table below) - Opus 4.7 on Vertex AI has 65.30% cache hit rate while Google's own Google_Gemini_3.1_Pro_Preview has 37.30% (and this trend applies to all Claude vs Gemini hosted on Vertex)! How do you manage to get a lower cache hit rate on your own hardware with your model trained on that hardware than a competitor's model? If I was to speculate, I'd guess that the whole 'thought signature' architecture is just not working out.

OSS Models Cost Comparison

OSS models, depending on who you use them from, make a huge difference, mostly due to cache-hit rates.

Model<br>Cheapest effectiveInputPrice (Provider)<br>Most Expensive effectiveInputPrice (Provider)<br>Difference (Percentage)

Kimi K2.6<br>0.2760 (StreamLake)<br>1.0900 (Phala)<br>0.8140 (294.93%)

MiMo-V2.5-Pro<br>0.3720 (Xiaomi)<br>0.9060 (DeepInfra)<br>0.5340 (143.55%)

DeepSeek V4 Pro (Max)<br>0.0560 (DeepSeek)<br>1.7220 (Parasail)<br>1.6660 (2975.00%)

GLM-5.1<br>0.3230 (StreamLake)<br>1.7470 (Venice)<br>1.4240 (440.87%)

MiniMax-M2.7<br>0.1430 (MiniMax)<br>0.6000 (SambaNova)<br>0.4570 (319.58%)

Small Model Grift

Now onto smaller models that seem instinctively cheaper. Below is the average effective pricing for 4 of the most popular local models (hi r/localLlama)

Model Name<br>Total Providers<br>Avg Eff. Input<br>Avg Eff. Output<br>Avg Cache Hit

Google_Gemma_4_26B_A4B<br>10<br>$0.1156<br>$0.4150<br>21.57%

Google_Gemma_4_31B<br>10<br>$0.1729<br>$0.5049<br>7.44%

Qwen_Qwen36_35B_A3B<br>$0.1643<br>$1.1450<br>11.54%

Qwen_Qwen36_27B<br>$0.4096<br>$2.9433<br>7.66%

Compare that to

Model Name (Official API)<br>Eff. Input Price<br>Eff. Output Price<br>Cache Hit Rate

DeepSeek_DeepSeek_V4_Pro<br>$0.0560<br>$0.8690<br>87.90%

DeepSeek_DeepSeek_V4_Flash<br>$0.0220<br>$0.2800<br>86.10%

Yup, you can use DeepSeek V4 Pro , a 1.6 Trillion parameter model whose active 49B parameters are higher than the total parameters of any small model, for cheaper...

cache model rates input turn from

Related Articles