GitHub - rduffyuk/engineering-memory-benchmark: Empirical study: layered retrieval (typed→semantic→grep) scores 0.954 for LLM-generated engineering artifacts. 5 conditions, 3 model tiers, 36 generated ADRs, 23 score files. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
rduffyuk
engineering-memory-benchmark
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>2 Commits<br>2 Commits
data
data
rubric
rubric
scores
scores
scripts
scripts
LICENSE
LICENSE
PAPER.md
PAPER.md
README.md
README.md
calibration-manifest.json
calibration-manifest.json
View all files
Repository files navigation
Engineering Memory Benchmark
Don't Choose Your Memory Tool — Layer Them.
An empirical study comparing retrieval methods for LLM-generated engineering artifacts (Architecture Decision Records). Tests 5 retrieval conditions + 3 model tiers on a production K8s engineering platform with 3 months of accumulated engineering history.
Key Finding
Layered retrieval (typed discovery → semantic context → file verification) scores 0.954 on a 5-dimension rubric, beating every individual method:
Condition<br>Mean Score<br>Cost/ADR
A — No memory<br>0.572<br>~$1.00
B — Semantic search (Qdrant)<br>0.720<br>~$1.50
C — Grep + file read<br>0.918<br>~$1.80
D — Typed-fact retrieval only<br>0.650<br>~$1.20
E — All three layered<br>0.954<br>~$2.50
Sonnet + layered retrieval (0.88) matches Opus + layered (0.91) at 5x less cost. Haiku fails on complex topics (0.35) despite rich context — there's a minimum model capability floor.
Four Findings
Retrieval methods compose super-linearly — E > max(B,C,D) because each layer catches errors the others introduce
Semantic search can hurt below baseline — returns adjacent-but-wrong context that the LLM trusts
Extraction quality is the binding constraint — typed retrieval is only as good as what was extracted
Model matters less than retrieval — Sonnet+E ≈ Opus+E, but Haiku+E fails (capability floor between Haiku and Sonnet)
Repository Structure
├── PAPER.md Full paper (3,700 words)<br>├── data/<br>│ ├── ground-truth/ 5 real ADRs from production (gold standard)<br>│ ├── condition-a/ Generated with no memory<br>│ ├── condition-b/ Generated with semantic search only<br>│ ├── condition-c/ Generated with grep + file read<br>│ ├── condition-d/ Generated with typed memory tools only<br>│ ├── condition-e/ Generated with all three layered (Opus)<br>│ ├── condition-e-sonnet/ Generated with layered retrieval (Sonnet)<br>│ └── condition-e-haiku/ Generated with layered retrieval (Haiku)<br>├── scores/ 23 JSON score files (per-claim decomposition)<br>├── rubric/<br>│ └── locked-rubric-v1.md Immutable scoring rubric (5 dimensions)<br>├── scripts/<br>│ └── score_with_gpt4o.py GPT-4o dual-judge scoring script<br>├── calibration-manifest.json 15 calibration artifacts<br>└── LICENSE CC-BY-4.0
Methodology
Rubric : 5 dimensions (technical correctness, citation, completeness, conciseness, pattern adoption), locked per RULERS methodology (arXiv 2601.08654)
Judge : Claude Opus 4.7 (primary) + GPT-4o (dual-judge validation, 100% rank agreement on top condition)
Isolation : Each condition runs in a fresh LLM session with only the tools that condition allows
Evidence trail : Every score JSON includes per-claim reasoning explaining why each score was given
The 3-Step Workflow (for practitioners)
Step 1 — DISCOVERY (typed memory)<br>"What decisions/problems exist about this topic?"<br>→ recall_decisions(topic=X), find_problems(topic=X)
Step 2 — CONTEXT (semantic search)<br>"What else is related?"<br>→ auto_search_vault(query=X)
Step 3 — VERIFICATION (file access)<br>"Do the facts check out against source?"<br>→ grep + read the actual files
Skip layers only for trivial lookups. The full workflow costs 5% more than grep alone but consistently produces better output.
Platform
Built on Rootweaver — a typed engineering-memory platform running on single-node K3s (RTX 4080). 248 sessions, 2,748 typed facts, 6,135 artifacts, 376 v2-quality enriched facts across 3 months of real engineering...