Open Knowledge Format Benchmark: What Happens When OKF Runs Inside an AI Tool | Tenure Platform<br>Project Resume Memory Modes PR Review<br>Editor Integrations<br>VS Code VSCodium<br>Compatible Clients<br>Mobile Open WebUI OpenClaw<br>Teams<br>Teams Overview Shared Memory AI Governance EU AI Act Compliance<br>How it works<br>Belief Merging Contradiction Handling Memory Modes AI Governance<br>Resources<br>Docs Writing Benchmark Paper GitHub
Install Free
Writing › Benchmarks<br>Open Knowledge Format Open Knowledge Format benchmark: what happens when OKF runs inside an AI tool
Google dropped the Open Knowledge Format as a simple markdown standard for AI knowledge.<br>We wanted to test the part that matters in practice: what happens when that bundle is placed<br>inside an AI tool and the model has to decide which files to inspect.
Tenure research · ~7 min read
TL;DR<br>This was not a test of markdown as a storage format.<br>It was a test of the runtime pattern implied by OKF when a bundle is dropped into an AI tool today.<br>The model received normal tool access to list files and read files. No custom retrieval-agent system prompt.<br>PrecisionMemBench scored the belief IDs corresponding to the files the model actually read.<br>OKF did much better than raw semantic memory retrieval, but it still showed the same core issue: files are a format, not a retrieval policy.
Why test it I like OKF. I just wanted to see what happens when an AI tool actually has to use it.
I like the direction of the Open Knowledge Format. A directory of markdown files with YAML<br>frontmatter is boring in the best way. It is easy to read, easy to diff, easy to commit, and<br>easy for humans to maintain. That is a real advantage over proprietary memory blobs or hidden<br>vendor indexes.
But after reading the spec, the obvious question was not whether markdown is a good interchange<br>format. Markdown is fine. The question was what happens at runtime.
Because the moment an OKF bundle gets placed into an AI tool, somebody still has to decide which<br>files enter the model request. Maybe the model lists the files. Maybe it reads one. Maybe it reads<br>several. Maybe it never opens the file that actually contains the needed belief. The format makes<br>the knowledge portable. It does not, by itself, make retrieval precise.
The thing we tested was not Open Knowledge Format as a storage layer. We tested the access pattern<br>most teams would get if they dropped an OKF bundle into a tool today and let the model inspect it.
The setup How we ran the OKF bundle
The wrapper was intentionally plain. The model got a normal assistant prompt and two tools:<br>one to list available OKF markdown files, and one to read a specific file. The model was not<br>given a custom retrieval-agent prompt telling it how to behave. It just received a user query<br>and decided whether to inspect the bundle.
When the model called read_file, the wrapper parsed the file's frontmatter, extracted<br>the beliefId, and returned that ID to PrecisionMemBench as a retrieved belief. That is the<br>whole bridge. Files read by the model became retrieved belief IDs. Files not read did not count.
That matters because PrecisionMemBench does not score whether the final answer sounds good.<br>It scores whether the system retrieved the right underlying beliefs. In this run, the question was<br>simple: did model-directed file access pull the right OKF documents into the request?
Single-turn run<br>0.47 Mean precision
77 cases, 36 passes, 18 active retrieval passes, 0.91 mean recall, and 4.4s mean latency.
Session run<br>0.17 Pass rate
12 session turns, 2 passes, 1 active retrieval pass, 0.45 mean recall, and 59.3s p95 latency.
The encouraging part OKF improved the shape of the problem.
The good news is that OKF did not behave like raw vector recall. The model had filenames, titles,<br>descriptions, types, tags, and readable file bodies. That gave it more handles than a cosine search<br>over opaque memory chunks.
Alias resolution was the clearest win. Across 23 alias cases, the run reached 0.72 mean precision<br>and 0.92 mean recall. Some short-form queries worked exactly how you would hope. A query for<br>GHA could lead the model to the GitHub Actions belief. A query for Mongo could lead<br>it to the MongoDB decision. In those cases, the filesystem pattern gives the model a real path<br>to the right document.
Ranking stability also looked strong. Those cases passed cleanly. That is worth saying because it<br>means the result is not a blanket criticism of OKF. When the query maps cleanly to the file surface,<br>markdown files with frontmatter are a perfectly reasonable representation.
The hard part But file access is still not memory retrieval.
The failures showed up where memory systems usually fail: scope, supersession, type routing, and<br>session drift. These are not markdown problems. They are state problems.
Scope disambiguation had 12 cases and only 4 passed. Mean precision was 0.21. This is the classic<br>Redis problem. If Redis appears in a code context and a writing context, the model has to...