Digesting a codebase before a model reads it – Matthew Johnston – Retail Data Science and Biology, PhD
Digesting a codebase before a model reads it
Across every organisation I’ve worked in, documentation is either missing or, once written, out of date. So, we’ve stopped treating it as something people maintain and made agents regenerate it on every code change.
TL;DR
estate-wiki is an internal, self-updating wiki for the Jollyes backend estate: one page per Bitbucket repo, regenerated by a scheduled agent that re-reads the code on a cadence, skipping repos that haven’t changed. The docs are a generated artefact, so they cannot rot.
Each repo read produces two views from one pass : a README human view and a CLAUDE.md machine view, rendered together with a structured facts blob that agents consume via an MCP .
To help produce better outputs, key files from the repos (like DAGs or SSIS packages which are too big, convoluted, or unsafe to feed in raw) are digested deterministically first.
There’s a fun symmetry between each file being pre-digested into an information-rich summary and the project as a whole condensing the estate into a summary other agents query: like Russian dolls of information.
The stack
estate-wiki runs a single, provider-agnostic review agent against the Jollyes Bitbucket estate inside an ephemeral ECS task. The model call sits behind a neutral interface, so the same agent runs on OpenAI, Anthropic, or others. It auto-discovers repos from the workspace, so the live wiki now covers a few hundred with no config edit when a new repo appears.
Airflow DAG<br>ECS Fargate task (review agent, service token)<br>│ clone → HEAD → skip-if-unchanged?<br>scope to git-tracked files ──► digesters (.dtsx, DAGs)<br>ONE summarise pass ──► facts JSON + human view + machine view<br>│ (model chosen by config: OpenAI · Anthropic · Bedrock)<br>backend REST /api/private/* ──► Postgres (one row per repo)<br>└── read back by AI agents over the MCP
The agent never touches the database directly. It writes through /api/private/* with a service token, so the backend stays the single Postgres writer (and reader). The digesters box is the step I want to dwell on: it runs before the model and decides what the model is given to read.
Digesting files before reading them
Not every file is source code you can hand directly to a model. For both SSIS packages and Airflow DAGs, we have deterministic digesters that run before any LLM call. The model never sees raw files, for both security and better context.
For example, SSIS (.dtsx) packages are often huge XML documents (a single MAIN.dtsx can be ~800 KB) and may contain encrypted secrets. Passing the raw XML to a model would be slow, expensive, and could expose credentials. dtsxDigest parses the package into a compact, secret-masked JSON representation containing:
Connection managers (the actual source and destination systems)
Data-flow components in execution order
SQL executed by each step
The result reads like a concise “source → transform → destination” pipeline rather than hundreds of kilobytes of XML markup. Interestingly, despite the model being able to read the entire file into context easily, and the transformation being a simple deterministic script - the outcome of runs with a digested file is significantly better. It seems the digest structures the data better for the model than the raw XML does: it organises the logic semantically, surfacing the key flows the documentation needs to capture.
I find it an interesting principle to think over: use cheap, fast and deterministic parsing to decide exactly what information reaches the model, and in what shape. The expensive LLM step only receives a clean, safe, structured, information-dense representation.
Two views, one source
Each repo is one Postgres row holding both rendered views plus a facts JSONB. The agent extracts the facts and renders both markdowns in a single pass: a README-flavoured human view for the helpdesk and new starters, and a CLAUDE.md-flavoured machine view for developers and AI agents.
The facts blob does three jobs: it seeds both views, it’s cheap grounding handed to the Q&A agent so it needn’t re-read a whole repo per question, and it’s machine-consumable over the MCP. Fields include languages, endpoints, env vars, data stores, integrations, deploy target, owners and key files plus category-specific dags[] and ssisPackages[].
The same idea extends across the estate: the whole estate condenses into one summary that other agents call over the MCP. Files digested for the model, nested inside an estate digested for other agents.
An example: stock, end to end
The wiki has a built-in Q&A per page, where the agent responds to user questions from the facts blob. Conceptually that’s simple, as the information is self-contained.
A much harder question is one that spans the whole estate, and here the wiki MCP comes into its own. I asked Claude Code: how does a “linked” pack/single SKU work, end to end?...