GKE Inference Gateway prefix caching accelerates AI inference | Google Cloud Blog<br>Contact sales Get started for free
Containers & Kubernetes
Report: GKE Inference Gateway delivers up to 92% faster AI responses
June 9, 2026
Bob Tian<br>Software Engineer
Susan Wu<br>Outbound Product Manager
Try Gemini Enterprise Business Edition today<br>The front door to AI in the workplace<br>Try now
As generative AI moves from experimental pilots to massive production environments, the efficiency of your infrastructure becomes the ultimate differentiator. One way to get the most out of it and minimize costly accelerator idle time is to leverage the Google Kubernetes Engine (GKE) Inference Gateway, which intelligently routes generative AI workloads based on real-time model server metrics.
Instead of relying on traditional, naive round-robin load balancing — which frequently triggers expensive accelerator recomputation and spikes user latency — this native extension of the GKE Gateway utilizes advanced capabilities like prefix caching and model-aware routing. By ensuring requests land on the exact accelerator that is primed to process them right away, GKE transforms how you can serve your large language models (LLMs), with excellent hardware utilization and ultra-fast response times.
In fact, according to an independent benchmark report, GKE Inference Gateway outperforms the next leading managed Kubernetes service with 15.7% higher throughput, 92.8% shorter wait times, and 62.6% lower inter-token latency . This performance takes LLM-based applications from sluggish and expensive to fast and production-grade.
That performance tracks with Snap’s experience using GKE Inference Gateway.
“At Snap, we are integrating llm-d into our production AI infrastructure to facilitate high-performance inference at scale. By employing prefix-cache-aware routing, we have achieved prefix cache hit rates ranging up to 75-80%. We appreciate the open-source nature of llm-d, as it enables seamless integration with our Envoy-based Service Mesh.” - Vinay Kola, Senior Manager, Software Engineering, Snap Inc.
In this blog, we take a closer look at GKE Inference Gateway’s prefix caching, complete with examples. We also provide more details about its benchmark results. Let’s jump in.
The secret to low-latency AI: Prefix caching
Prefix caching optimizes LLM performance by storing the KV cache (activation states) of long, repetitive prompt prefixes. When consecutive user requests share the same system instructions, context, or documentation, the model entirely skips reprocessing those tokens. GKE Inference Gateway reads incoming request prefixes and matches them to the specific pods that already hold that data in memory. This eliminates the "thinking" tax on your GPUs and TPUs, turning heavy reasoning loops into near-instant answers.
Use case 1: Documentation and codebase Q&A with retrieval-augmented generation (RAG)
When querying massive enterprise repositories, you can ground your LLMs’ responses without any added latency by pinning entire documentation sets as static cached prefixes, using RAG.
Instead of forcing an LLM to re-read thousands of lines of API references or corporate wikis for every single user question, GKE Inference Gateway routes the query to a pod that already has that specific context warmed up in its KV cache. The LLM only has to compute the user's brief, dynamic question, completely bypassing expensive document re-evaluation.
[10,000+ words of API reference documentation, endpoints, error codes, etc.]
[DYNAMIC SUFFIX - CHANGES PER REQUEST] User Question: How do I handle a 429 rate limit error using the Python SDK?" soy-skip ssk='4:dXrt'><br>Loading...
[STATIC PREFIX - STAYS IN CACHE] You are an expert AI assistant specializing in technical documentation. Below is the complete API documentation for our software platform. Use this context to answer the user's questions accurately. If the answer cannot be found in the documentation, say "I cannot find that in the provided context."
[10,000+ words of API reference documentation, endpoints, error codes, etc.]
[DYNAMIC SUFFIX - CHANGES PER REQUEST] User Question: How do I handle a 429 rate limit error using the Python SDK?
Figure 1: Prompt breakdown for a software troubleshooting scenario showing cached static prefix and dynamic suffix that changes per request
Use case 2: Multi-turn chat
You can also use prefix caching to maintain customer service interactions across thousands of simultaneous sessions without compounding compute costs. You can do so by caching permanent system personas and core business rules directly on the LLM server.
In enterprise chat architectures, the base system prompt and reference tables remain completely identical across millions of customer interactions. GKE Inference Gateway handles these multi-turn conversations using context-aware routing to bypass repetitive token processing, so that your chatbot stays ultra-responsive even under peak...