Context Graphs vs. Vector RAG vs. Raw Context

Context Graph vs RAG vs Raw Context - A benchmark for agent memory

Solutions

Resources

COMPANY

Blog

Partners

Customer stories

About

COMPARE

Nanonets vs ABBYY Nanonets vs DEXT Nanonets vs Docparser Nanonets vs Kofax Nanonets vs Rossum

Didn’t find what you’re looking for?

Talk to us

Pricing

Get started for free Request a Demo

Context Graphs vs Vector RAG vs Raw Context - A benchmark for agent memory

By karan-kalra

July 01, 2026

18 min read

Link copied! Copy failed!

Table of contents

Retrieval is critical in AI agents. To do any task correctly, the agent needs to be able to retrieve all the information that is relevant to the task from its memory. Context graphs are all the rage right now, so I benchmarked them against the alternatives. This post explains how each memory method works, what the benchmark asks, what the data is, and what each method got right and wrong. The agent failure case One of our clients came to us after their in-house agent kept dropping facts. A sales-support agent which needs to know "which office handles our Acme account?" to do a task. It couldn't answer this. The agent had everything it needed for answering this in its memory. Full conversation histories, vector databases on top, the lot. Someone had already noted that the Acme account is handled by Dana. Somewhere else, a PostgreSQL row noted that Dana works out of the Berlin office. Both facts were sitting right there but the agent failed to retrieve them and put two and two together. To correctly retrieve and get the answer, the agent's retrieval method had to join two facts that were never said in the same breath, and nothing in the agent's standard memory setup did that on its own. A context graph is built to fix these failures. What similarity search can't do Here are two facts an agent might ingest in its memory, days apart: The Acme account is handled by Dana. Dana works out of the Berlin office. Now the question is: "Which office handles the Acme account?" No single message answers it. You have to chain two facts that were never said together. This is called a multi-hop question, because the answer is two hops away: the Acme account, to Dana, to the Berlin office. The fact you need, "Dana works out of the Berlin office," never mentions Acme. So similarity search ranks it low against a question about the Acme account and skips it. A graph doesn't rank, it follows the edge.How each memory method works I tested four ways to give an agent memory. Hold the above example in your head, I'll use it to walk through how each memory method works. When I get to the benchmark I'll switch to the data I actually run (software agents coordinating on work). If you already know about these retrieval methods, skip to the benchmark. 1.Raw context The simplest possible memory where you dump everything, including the conversation histories and PostgreSQL dbs, in the memory. The model reads this text dump to answer the question. Obvious issues with this method that don't need a benchmark to understand - Cost. You resend the whole history on every single question, and that bill grows with every message. Attention. LLMs reliably read the start and end of a long context and get hazy in the middle. 2.Vector RAG This is the standard production method today. "RAG" is retrieval-augmented generation: instead of sending everything, you try to retrieve only the relevant bits from the memory and send those to the LLM. You take each message and run it through an embedding model, which turns text into a list of numbers (a vector) that captures its meaning. Similar meanings land near each other in this number space. "Who looks after the Acme account" lands near "the Acme account is handled by Dana", because the model knows "looks after" and "handled by" mean the same thing. You store all these vectors. When a question comes in, you embed the question too, find the handful of stored messages whose vectors are nearest, and send only those to the model. This is genuinely powerful. It shrugs off wording. Ask who "looks after" an account and it finds who it's "handled by." And the cost is flat: you always send the same small handful of messages, no matter how long the history gets. But notice what it does for the account question. It scores each message against your question on its own. "The Acme account is handled by Dana" looks relevant, it has "Acme account." "Dana works out of the Berlin office" looks much less relevant, because your question never mentions Dana. So the second fact, the one you actually need for the office, often doesn't get retrieved. Standard vector search ranks facts one at a time. It has no way to say "fetch this fact, then follow it to the next one." And better embeddings won't fix this. 3.Context graph In a context graph, you stop storing the text directly, and instead store the facts extracted from the text as a graph. A graph is nodes connected by edges. Each node becomes an entity, each edge becomes a relationship...

Context Graphs vs. Vector RAG vs. Raw Context

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI