Every AI Agent Feature Is a Cache Invalidation Surface | OpenClacky
dropdown#close">
dropdown#close"<br>class="mb-4 p-2 rounded-lg transition-colors">
dropdown#close">
GitHub
Language:
locale#switch"<br>data-locale-locale-param="zh"<br>data-locale-target="button"<br>style="color: hsl(var(--color-ink-2));">中<br>locale#switch"<br>data-locale-locale-param="en"<br>data-locale-target="button"<br>style="color: hsl(var(--color-ink-2));">EN
dropdown#close" href="/creators">
Creators
dropdown#close" href="/features">
Features
dropdown#close" href="/pricing">
Pricing
dropdown#close" href="/ai-keys">
AI Keys
dropdown#close" href="/enterprise">
Enterprise
dropdown#close" href="/docs">
Docs
Sign In<br>Sign Up
I'm Yafei Lee, founder of OpenClacky, an open-source AI agent written in Ruby. We wanted an agent with skills, memory, sub-agents, browser automation, dynamic model switching, and long-running sessions. Each of those features made prompt caching worse in a different way.
That was the real architecture problem. Not how to call an LLM, not how to add another tool, not how to orchestrate more agents — how to keep the cache prefix stable while the product keeps changing.
Every agent feature is also a cache invalidation surface. Skills load new system context. Peer-agent workflows fork the prefix. Browser automation adds volatile tool output. Compression rewrites history. Model switching can fragment the cache namespace unless model-specific state stays out of the system prompt. If you're building a capable agent and your cache hit rate is much lower than expected, this is probably why.
Over two years and three architecture generations (the first two failed), we converged on seven engineering decisions that let us hit 90%+ cache rates across real tasks — while keeping all those features intact. What follows is the complete story: what broke, what we tried, and what actually worked.
Generation 1: RAG Everything (2024 – early 2025)
Our first agent was a textbook RAG system. We embedded the user's codebase, docs, and conversation history into a vector store. Every query went through hybrid retrieval, re-ranking, and query rewriting before the LLM saw anything.
It sounded right. It wasn't.
The costs never stopped climbing, and the data was always stale. Every codebase update required re-embedding. Real-time sync was unreliable, so the vector store lagged behind the actual code. We were paying more and more to search an index that was increasingly wrong.
And 90% recall is not good enough. One in ten retrievals returned the wrong context. For an agent that chains multiple steps, that error compounds fast. A wrong file in step 2 means a wrong edit in step 3 means a wasted retry in step 4. We estimated that 97% recall might be the bare minimum for an agent to be net-positive, and we were nowhere close. On top of that, the vector database was one more component that could crash, lag, or return garbage. Every extra piece between the user and the LLM is a place where latency hides and errors compound.
For coding agents working over local repos, we killed RAG entirely. No embeddings, no vector store, no retrieval pipeline. If the agent needs context, it reads files directly or searches with grep. If your documentation needs to be accessible to an agent, make it readable on a website. Don't shred it into embeddings.
Generation 2: Multi-Agent Orchestration (mid-2025)
The next idea was straight from the SWEBench leaderboard playbook: a Planner agent, a Coder agent, a Reviewer agent, and a Tester agent, all coordinated through a message bus with role-specific prompts.
We got decent SWEBench scores. The product was terrible.
Every agent handoff was a cache miss. Each sub-agent had its own system prompt and cache namespace. Passing context between agents meant serializing state into messages, and every handoff wiped the receiving agent's cache prefix. The problem was not just cache misses. Each handoff forced us to serialize rich state into a smaller message, and useful context was lost at the boundary.
A task that one agent could finish in 4 minutes took 14 minutes with four. The coordination overhead was real: agents waited for each other, re-read context the previous agent had already processed, and occasionally contradicted each other's decisions.
Cost was 6× higher. Four separate cache namespaces, four system prompts, constant serialization. The "divide work among specialists" intuition that works for human teams doesn't transfer to LLMs. A single frontier model is already a generalist. You're not dividing labor; you're multiplying overhead.
Debugging was a nightmare. When the final output was wrong, which agent caused it? The Planner gave ambiguous instructions? The Coder misinterpreted them? The Reviewer missed the bug? We spent more time tracing failures through the pipeline than we spent on the original task. At least with a single agent, when something goes wrong, you read one conversation and find the...