79% on LongMemEval: How We Beat Full-Context GPT-4 with a Local SQLite Database

79% on LongMemEval: How We Beat Full-Context GPT-4 with a Local SQLite Database | by Vektor Memory | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

A benchmark result that changes what we thought was possible for local persistent agent vector memory

Vektor Memory

9 min read· Just now

Listen

Press enter or click to view image in full size

We ran VEKTOR Slipstream against LongMemEval this week and got a result we were very impressed with. 79.0%. That is 12 points above full-context GPT-4, 17 above Mem0, 24 above ReadAgent, and 30 above MemGPT. To understand why that number matters, you need to understand what LongMemEval is actually testing, why it is hard, and what it took to get there.

What LongMemEval Is and Why It Is the Hardest Memory Benchmark Memory benchmarks operate on different testing question criteria. They test whether your system can retrieve a fact that was stored recently, in a clean format, with an obvious query. That is approximately what happens in a controlled demo. It is not what happens in production. LongMemEval is slightly different. It was designed specifically to stress-test the failure modes of real memory systems over real conversations. The benchmark contains 500 questions drawn from genuine multi-session chat histories, with an average of 344 memory items per question. The questions are distributed across seven categories, each targeting a specific failure mode: Single-session retrieval tests whether you can answer a question from a single conversation correctly. Sounds easy. The catch is that the answer is buried in a long session, surrounded by noise, and the query phrasing bears no resemblance to how the answer was stored. Multi-session reasoning asks you to connect facts across conversations that happened at different times. “What did the user say about their job last month” requires knowing that those memories exist and linking them. Temporal reasoning tests date-anchored facts. “Where was the user living when they started their new job?” requires understanding which memories belong to which time window. Knowledge updates test whether your system correctly invalidates old facts. If a user says, “I moved to San Francisco" after previously saying, “I live in Los Angeles," the correct answer to "Where does the user live?” is San Francisco. Systems that append rather than supersede fail this category consistently. Abstention tests whether your system knows when it does not know. Many systems hallucinate an answer rather than say “I don’t have that information.” Abstention at 90% means VEKTOR declined to answer when it lacked the information, nine times out of ten. The baseline in this benchmark is brutal. Full-context GPT-4, where the entire conversation history is stuffed into the context window, scores 67%. That is the system where the model literally sees everything and has to do nothing intelligent with storage. VEKTOR, running on local SQLite, beat it by 12 points.

The Four Versions We Ran to Get Here We did not start at 79%. We started at 48.6% and ran four iterations to understand what was failing and why. v1 (48.6%) was a naive implementation: store every turn as raw memory and retrieve it by vector similarity. The immediate failure was obvious. Questions like "What did the user say about their sister’s wedding?” returned semantically similar memories about events, parties, and celebrations. Technically correct retrieval. Wrong answer. v2 (57.1%) added BM25 keyword search fused with semantic search via Reciprocal Rank Fusion. This improved single-session recall significantly. Multi-session questions still failed because the system had no way to reason about when memories occurred relative to each other. v3 (55.2%) was a step backward. We introduced aggressive deduplication and contradiction detection, which accidentally removed valid memories that looked similar but referred to different time periods. Lesson: deduplication needs temporal awareness, not just semantic similarity. v4 (79.0%) introduced what we are calling routed ingest, and it is the architectural decision that drove the result. Press enter or click to view image in full size

Routed Ingest — The Strategy That Changed Everything The core insight behind routed ingest is simple: different types of memories benefit from fundamentally different storage strategies. Before this, every conversation turn was stored the same way. Raw text, embedded, inserted. The problem is that “I moved to San Francisco last Tuesday” and “I prefer dark mode” and “the payment API went live yesterday” are three completely different types of information. Treating them identically is why most memory systems plateau in the 55 to 65% range. Routed ingest assigns each memory to one of two pipelines at write time: Extraction pipeline for complex, cross-session, time-sensitive information....

79% on LongMemEval: How We Beat Full-Context GPT-4 with a Local SQLite Database

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs