Improving token efficiency in GitHub Copilot
📼 Rewatch VS Code Live at MS Build 2026
Dismiss this update
Improving token efficiency in GitHub Copilot
June 17, 2026 by Ryan Caldwell and Bhavya U
With the recent move to usage-based billing for GitHub Copilot, every token in an agentic session matters. They affect your credits, latency, and the context window an agent has left to finish the task. Each new model generation tends to consume more tokens per task than the last, as we've witnessed in our own data. This means that harness-level efficiencies are increasingly important to counter this trend. As agents take on longer, more autonomous work, an inefficient harness adds up fast.
Making the GitHub Copilot agentic harness more token-efficient is continuous work, and it's the best way to counter this trend. For most changes, we run A/B experiments in production and offline evaluations against task suites, confirming that task success rate holds or improves while token usage drops. It's rarely one big win, usually a steady stream of small ones. Below, we walk through recent gains, first for OpenAI models and then for Anthropic models.
How agentic requests spend tokens
Two costs sit at the heart of every agentic request, and two ideas help us reduce them. Both apply across OpenAI and Anthropic models, even though each provider exposes them differently.
Graphical overview of the prompt signature highlighting the different parts of the prompt.
The prompt prefix and caching. In an agentic coding session, a large share of every request repeats across turns: system instructions, tool definitions, repository context, and conversation history. This repeated beginning is the prompt prefix . When requests share the exact same prefix, the inference provider can reuse cached model state instead of recomputing it from scratch on each request. Despite the name, the cached artifact is not a human-readable copy of the prompt. It is the model state computed while processing that prefix, represented internally as key/value tensors. Reusing the prefix cuts both cost (cached tokens can be up to 10 times cheaper ) and latency, which is why we work to keep the prompt cache hit-rate high.
Tool-definition overhead. Agents can pull in a large number of tools: those exposed by MCP servers, built-in tools, or extension-provided tools. Each tool is sent to the model with a full definition (a name, a description, and a complete JSON parameter schema), and historically every one was loaded into context on every request. Even when that data is cached, the context window overhead is fixed on each turn and grows as the toolset does.
Tool search. Tool search reduces that overhead by letting the model load tool definitions on demand instead of all at once. Upfront, the model sees only lightweight metadata, the name and description of each deferred tool, and the heavier parameter schemas stay out of context until the model searches for a tool and loads it. Because deferred tools are added at the end of the context window rather than the prefix, the cached prompt prefix stays reusable and the caching gains keep working across turns. The payoff is a leaner context window: the model spends fewer tokens on tools it never uses, leaving more room and budget for the actual task.
Efficiency wins for OpenAI models
For OpenAI models, our recent work focused on reducing usage costs and latency for Copilot users through improved token efficiency. We pursued that through three changes: retaining cached model state for longer, reducing tool-definition overhead, and replacing repeated HTTP requests with persistent WebSocket connections.
Extended prompt caching
OpenAI models cache the prompt prefix automatically: the provider infers the reusable prefix and reuses its model state across requests. That reuse has a direct cost benefit. For most OpenAI models that support cached input pricing, uncached input tokens cost 10 times as much as cached input tokens.
Caching the prefix happens on its own, but how long that cache survives is something we can configure. After careful evaluation, we enabled extended prompt caching for supported models through the prompt_cache_retention body parameter. By default, the cache lives in fast GPU memory, where it is dropped after about 5 to 10 minutes of inactivity (up to an hour in some cases) to make room for other work. Setting "prompt_cache_retention": "24h" moves the cache to slower but roomier GPU-local storage and keeps it for up to 24 hours.
The benefit is simple. With the default cache, a pause of more than a few minutes throws the cache away, so your next request has to reprocess the whole prefix at the full, uncached price. Extended retention keeps the cache warm, so picking up where you left off is still fast and cheap, even after a long break.
After enabling extended prompt caching for supported OpenAI models in VS Code, we measured the following relative increases in cache hit rate. These are...