Subconscious Cache: Reliably Capture Your Agent Context | Subconscious BlogSign InTry the API for Free
ResearchPricingDocsBlogSign InTry the API for Free
Research
Subconscious Cache: Reliably Capture Your Agent Context<br>June 3, 2026
Hongyin Luo<br>Co-Founder & CTO
Wei Fang<br>Founding Scientist
Every time an agent compacts its context, today's inference systems throw away work they already computed and re-encode it from scratch, exactly when the trace is longest and the user is least patient. Subconscious Cache is the solution. Built into our runtimes, it makes our models faster and more accurate on reasoning tasks, all in an OpenAI Completions and Anthropic Messages API compatible format.
TL;DR<br>We show the deficiency of general-purpose language model inference systems for agent inference, especially when context engineering happens.<br>We present Subconscious Cache as a solution, which enables lower latency and bigger batch size in inference, and makes context engineering less lossy for better performance.<br>Enabled by Subconscious Cache, our TIM family of models no longer require server-side tool calls and recursive JSON reasoning format. TIM models now automatically and progressively compact reasoning history during inference using the familiar chat completions and messages formats.<br>We show experiments with small language models on agentic tasks with improved accuracy and efficiency. We will keep updating performance with frontier open-source models in the following weeks.<br>Background<br>The context problem for AI agents is far from solved, and it bottleneck both capability and efficiency.<br>On the capability side, modern agents rely on multi-hop tool calls, long reasoning traces, and self-healing from mistakes. Each of these inflates the message list. Even frontier LLMs with 1M-token windows begin to degrade as the window fills up, a phenomenon often called context rot. The standard remedy, context compaction, is itself lossy: it strips reasoning memory, intermediate instructions, and long-term constraints that the agent still needs. Open-source models feel this pain earlier and more frequently as an agent trace accumulates. State-of-the-art open alternatives such as Qwen and Nemotron cap out around 256k tokens, so agents built on them must compact more aggressively, and pay a steeper accuracy cost when they do.<br>On the efficiency side, the prefix caching mechanisms that modern inference systems rely on are helpful for building agents [Manus AI 2025, Cognition 2025], but still not efficient enough. Frequent context engineering constantly invalidates the cached prefix, forcing the system to re-encode large spans of tokens that were, in effect, already computed moments ago. Throughput collapses and latency spikes exactly when the agent trace is longest and the user is least patient.<br>In this article we introduce our approach to both problems: Subconscious Cache and Auto Compaction . Subconscious Cache extends prefix matching to also reuse cached suffixes, so that pruning a span of intermediate tokens no longer throws away the latent information they contributed and no longer forces a re-encode of everything that followed. Auto Compaction, building on our earlier work, lets the model itself decide what to prune at inference time, and relies on Subconscious Cache to preserve memory across those prunes. Together, the two mechanisms deliver better agent reasoning and substantially better inference efficiency.<br>Agents are bottlenecked by prefix cache<br>To partially prevent redundant computation, modern LLM inference systems preserve a prefix cache. When a new request comes in, it tries to match tokens via prefix lookup in the cache and reuse their KV pages and recurrent states (for hybrid models). This approach significantly reduces the redundant input encoding (prefilling), and thus improves the throughput for model servers and reduces latency for users and developers. Because re-computation is avoided, the frontier model providers only charge a fraction for cached tokens. To take advantage, frontier agent builders have been focusing on improving token hit rate for better efficiency (Manus article).<br>However, the prefix cache alone cannot effectively capture the entire context of an agentic task. As shown below, the long agent reasoning traces experience a pruning processing. When a new user message or tool response is appended to the message list, some old content is removed. This pattern happens in many use cases.<br>Assuming we are dealing with an agent loop. At turn k, the input consists of three sections of tokens: A, B, C, and the LLM generates the Output_k. In the next step, our agent harness appends new input D, but prunes tokens in section B during context engineering. This forms new input sequence for the LLM: A, C, Output_k, D. Such behavior is very common in modern AI application including chatbots and agents.<br>Chat: for long conversations, only keep the last few rounds. The removed section B can be a few...