We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper. | by Vektor Memory | May, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper.

Vektor Memory

10 min read· 1 day ago

Listen

Press enter or click to view image in full size

by VEKTOR Memory — 10 min read Found this whitepaper digging through ArXiv today; there are so many great papers and so little time in the day to read them all. A researcher at Microsoft published a paper in May 2026 measuring how well AI agents can continue tasks after their memory has been transferred to a different model. The Transfer Continuity Score they reported was 0.88, tested on GPT-4 Turbo across 50 engineering scenarios. We ran the same benchmark against VEKTOR Slipstream and scored 0.894. This article explains the methodology, the honest caveats, what we updated and built into Vex as a result, and why the lift ratio matters more than the headline number.

Why agent memory migration is disparate Every agent framework ships with some version of memory. Most of them store conversation history in a growing buffer that eventually hits token limits, gets truncated from the bottom, and produces an agent that can remember what happened five minutes ago but not five days ago. The more serious implementations use vector stores, which at least scale, but they introduce a different problem: the memories are trapped in whatever format the vector store uses. Move to a different framework and you either write a migration script or start over. VEKTOR Slipstream is our answer to the storage problem, SQLite-backed persistent memory with BM25+RRF recall and no cloud dependency. After a year of daily use I have 5725 memories in mine. Vex is our answer to the portability problem, a CLI tool that exports those memories to .vmig.jsonl, an open interchange format with connectors for Pinecone, Qdrant, Chroma, Weaviate, pgvector, and VEKTOR itself. What neither of these addressed until this week was integrity verification. You could export 5725 memories. You could not prove they were unchanged when someone imported them on the other side.

The Paper That Prompted All Of This “Portable Agent Memory: A Protocol for Provenance-Verified Memory Transfer Across Heterogeneous LLM Agents” by Santhosh Kumar Ravindran at Microsoft is a good piece of systems research. It proposes a five-component memory model (Episodic, Semantic, Procedural, Working, Identity), a BLAKE3 Merkle-DAG for content-addressed integrity, Ed25519 signing, and a seven-step re-hydration pipeline that defends against prompt injection through recalled memory. The benchmark result they report is a TCS of 0.88 across model pairs (Claude to GPT-4, GPT-4 to Gemini, Gemini to Claude), compared to a no-memory baseline of 0.35. That is a 2.51x lift. The methodology is documented well enough to replicate, so we did.

Building The Benchmark Harness Transfer Continuity Score is defined as task success with memory divided by task success without memory. A score of 1.0 means the target agent performs identically to the source agent. The PAM paper uses 50 tasks across three categories: 20 Q&A recall tasks, 15 coding continuation tasks, and 15 planning tasks. We wrote a Node.js benchmark that mirrors this structure exactly. Each task has a set of memories representing what the source agent “learned,” a natural language question, and a set of expected keywords that judge the answer. The judge is a normaliser rather than an LLM call, which keeps costs down and scoring deterministic. Writing a good normaliser turns out to be non-trivial: “eight million dollars” needs to match “$8m”, “September 1, 2026” needs to match “sep 2026”, “Net Promoter Score” needs to match “nps”, “48-hour window” needs to match “48 hours”. Four iterations to get it right. One upfront disclaimer: the PAM paper used GPT-4 Turbo for all evaluations. We used gpt-4o-mini for Q&A and coding tasks, and gpt-4o for planning tasks where the weaker model was paraphrasing too many exact figures. The comparison is directional, not perfectly controlled. We note this in the results.

Results VEKTOR TCS Benchmark, June 2026 N=50 tasks, methodology: arXiv:2605.11032Category VEKTOR PAM (GPT-4 Turbo) Q&A Recall 0.916 (gpt-4o-mini) 0.920 Coding 0.918 (gpt-4o-mini) 0.870 Planning 0.840 (gpt-4o) 0.850 Overall TCS 0.894 0.880No-memory baseline 0.149-0.253 0.350 Memory lift ratio 6.61x 2.51xThe headline is 0.894 vs 0.880. VEKTOR wins, narrowly, with a model that is meaningfully weaker than GPT-4 Turbo. On coding specifically the result is 0.918 vs 0.870, which is the cleanest comparison in the data because both runs used gpt-4o-mini with no handicap adjustment. The lift ratio deserves more attention than the raw TCS score. VEKTOR’s no-memory baseline is 0.149 to 0.253 depending on category. PAM’s baseline is 0.350. The tasks are harder for...

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan