Layered retrieval beats grep alone for LLM-generated engineering docs

rduffyuk1 pts0 comments

GitHub - rduffyuk/engineering-memory-benchmark: Empirical study: layered retrieval (typed→semantic→grep) scores 0.954 for LLM-generated engineering artifacts. 5 conditions, 3 model tiers, 36 generated ADRs, 23 score files. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

rduffyuk

engineering-memory-benchmark

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>2 Commits<br>2 Commits

data

data

rubric

rubric

scores

scores

scripts

scripts

LICENSE

LICENSE

PAPER.md

PAPER.md

README.md

README.md

calibration-manifest.json

calibration-manifest.json

View all files

Repository files navigation

Engineering Memory Benchmark

Don't Choose Your Memory Tool — Layer Them.

An empirical study comparing retrieval methods for LLM-generated engineering artifacts (Architecture Decision Records). Tests 5 retrieval conditions + 3 model tiers on a production K8s engineering platform with 3 months of accumulated engineering history.

Key Finding

Layered retrieval (typed discovery → semantic context → file verification) scores 0.954 on a 5-dimension rubric, beating every individual method:

Condition<br>Mean Score<br>Cost/ADR

A — No memory<br>0.572<br>~$1.00

B — Semantic search (Qdrant)<br>0.720<br>~$1.50

C — Grep + file read<br>0.918<br>~$1.80

D — Typed-fact retrieval only<br>0.650<br>~$1.20

E — All three layered<br>0.954<br>~$2.50

Sonnet + layered retrieval (0.88) matches Opus + layered (0.91) at 5x less cost. Haiku fails on complex topics (0.35) despite rich context — there's a minimum model capability floor.

Four Findings

Retrieval methods compose super-linearly — E > max(B,C,D) because each layer catches errors the others introduce

Semantic search can hurt below baseline — returns adjacent-but-wrong context that the LLM trusts

Extraction quality is the binding constraint — typed retrieval is only as good as what was extracted

Model matters less than retrieval — Sonnet+E ≈ Opus+E, but Haiku+E fails (capability floor between Haiku and Sonnet)

Repository Structure

├── PAPER.md Full paper (3,700 words)<br>├── data/<br>│ ├── ground-truth/ 5 real ADRs from production (gold standard)<br>│ ├── condition-a/ Generated with no memory<br>│ ├── condition-b/ Generated with semantic search only<br>│ ├── condition-c/ Generated with grep + file read<br>│ ├── condition-d/ Generated with typed memory tools only<br>│ ├── condition-e/ Generated with all three layered (Opus)<br>│ ├── condition-e-sonnet/ Generated with layered retrieval (Sonnet)<br>│ └── condition-e-haiku/ Generated with layered retrieval (Haiku)<br>├── scores/ 23 JSON score files (per-claim decomposition)<br>├── rubric/<br>│ └── locked-rubric-v1.md Immutable scoring rubric (5 dimensions)<br>├── scripts/<br>│ └── score_with_gpt4o.py GPT-4o dual-judge scoring script<br>├── calibration-manifest.json 15 calibration artifacts<br>└── LICENSE CC-BY-4.0

Methodology

Rubric : 5 dimensions (technical correctness, citation, completeness, conciseness, pattern adoption), locked per RULERS methodology (arXiv 2601.08654)

Judge : Claude Opus 4.7 (primary) + GPT-4o (dual-judge validation, 100% rank agreement on top condition)

Isolation : Each condition runs in a fresh LLM session with only the tools that condition allows

Evidence trail : Every score JSON includes per-claim reasoning explaining why each score was given

The 3-Step Workflow (for practitioners)

Step 1 — DISCOVERY (typed memory)<br>"What decisions/problems exist about this topic?"<br>→ recall_decisions(topic=X), find_problems(topic=X)

Step 2 — CONTEXT (semantic search)<br>"What else is related?"<br>→ auto_search_vault(query=X)

Step 3 — VERIFICATION (file access)<br>"Do the facts check out against source?"<br>→ grep + read the actual files

Skip layers only for trivial lookups. The full workflow costs 5% more than grep alone but consistently produces better output.

Platform

Built on Rootweaver — a typed engineering-memory platform running on single-node K3s (RTX 4080). 248 sessions, 2,748 typed facts, 6,135 artifacts, 376 v2-quality enriched facts across 3 months of real engineering...

retrieval generated condition engineering layered memory

Related Articles