How to benchmark persistent repo memory for coding agents

Benchmarking Greplica: Significant uplift on planning tasks on open-source repositories

← Blog

43%

Less cost

average reduction across the selected top 10 tasks

49%

Fewer tokens

less context spent versus baseline exploration

36%

Fewer tool calls

fewer repository exploration steps before planning

26%

Time saved

less elapsed planning time on held-out tasks

Contents

01Overview 02Why agents need memory 03What Greplica does 04Benchmark design 05Results 06Conclusion

Overview

Greplica improves coding-agent performance on complex engineering tasks by giving agents access to relevant memory from prior development sessions.

We benchmarked Greplica using the SWE-chat dataset on 10 selected high-context tasks across open-source repositories, and found that agents with Greplica memory consistently reached plans with less exploration than baseline agents that started from scratch.

Agents using Greplica performed better on all counts:

43% lower estimated cost

49% fewer tokens consumed

36% fewer tool calls

26% less time taken

Relevant context saved in memory and revealed to the agent when doing a related task improves task understanding, finding right subsystems and accounting for prior decisions, eventually leading to concentrated gains in producing an implementation plan.

In this post we walk through how how Greplica helps agents, how we designed that benchmark, what we measured, and what the pilot results show.

Why Coding Agents Need Memory

Coding agents are reasoning systems built around LLMs. On starting a new session, their context window only contains the user prompt, global skills and AGENTS.md. From there they must rebuild understanding of the codebase through tool calls: grep, glob, read, shell commands, and file inspection. Large repositories contain many millions lines of code, which means meaningful time and tokens lost reconstructing context that may already have been learned in previous sessions.

A larger context window does not automatically solve this. Too much irrelevant context can make the agent slower, more expensive, and less accurate. When the window fills up, harnesses compact the conversation and useful intermediate reasoning can be lost.

Developers compensate by giving project instructions in prompts, or writing them into AGENTS.md or other repo-level documentation. These are useful, but difficult to maintain, hard to keep current, and not designed for task-specific retrieval. As the project grows they either become too sparse or too large to trust.

What coding agents need is not just more context. They need persistent, queryable engineering memory .

What Greplica Does

Greplica works in the background, looking out for important bits of context to capture. It uses your coding session transcripts and fresh code changes to extract useful facts like architectural decisions, learnings from prior attempts, gotchas and edge cases. These are stored in a persistent SQLite-backed graph, automatically at the end of each session.

When an agent receives a new task, it can query Greplica before broad manual exploration . Instead of rediscovering the repository from scratch, it retrieves relevant prior context and uses that to produce a better plan.

We designed this benchmark to test whether that works on realistic, temporally valid session sequences.

Benchmark Design

We started with a specific question:

If a coding agent has access to memory built from prior related sessions on the same repository, does it produce a better plan for a later task — faster and with less exploration?

Why planning, not implementation

We chose the planning phase because most of an agent's initial exploration is spent understanding the repo, locating the right subsystem, and turning that context into a plan.

Data source

Cases are built from the SALT-NLP/SWE-chat dataset: real developer sessions with transcripts, checkpoints, and edit patches across many open-source repos.

Each case is a sequence of coding sessions :

Prior (memory-building) sessions (2-4) — chronologically before the session chosen for testing. Memory is built only from these.

Held-out (test) session — a later session on the same repo. Its main engineering task becomes the benchmark prompt. The agent never sees this transcript during memory build.

We built memory from prior sessions and ensured future sessions must not leak into memory.

Repository and task selection

We first shortlisted repositories by credibility (number of Github stars), history (number of past commits), and continuity (multiple contiguous sessions on related work).

From those, we chose 10 sessions where the user was doing highly contextual work: related to prior sessions or tasks requiring subsystem understanding rather than a one-file fix.

These tasks mimic real world development tasks in large, complex repositories.

Task Construction

For each chosen session, we inspect the work that happened in it and constructed a prompt for a planning task ,...

How to benchmark persistent repo memory for coding agents

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI