GitHub - arsenis-cmd/clai-benchmarks: Governed continual-learning memory for AI agents — rejects poisoned facts, derives unstored relationships. Reproducible head-to-head benchmarks. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
arsenis-cmd
clai-benchmarks
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>2 Commits<br>2 Commits
clai_engine
clai_engine
derivation
derivation
governance
governance
.gitignore
.gitignore
LICENSE
LICENSE
NOTICE
NOTICE
README.md
README.md
requirements.txt
requirements.txt
View all files
Repository files navigation
CLAI Benchmarks
Memory that rejects what it shouldn't learn, and derives what was never stored.
Most agent memory just stores and retrieves. CLAI vets what enters (governance ) and composes<br>relationships that exist in no single document (derivation ) — two things retrieval-first memory<br>can't do at the write path.
The gap, in one line: a store-everything memory admits 7 / 7 poisoned facts; CLAI admits<br>0 / 7 . And on multi-hop questions a knowledge graph splits into dead-ends (0 / 3 ), CLAI<br>derives the answer (3 / 3 ). Both reproducible below — the baseline side runs live.
This repository holds two reproducible head-to-head benchmarks . The baseline side of each runs<br>locally with no dependencies , so you can see the gap yourself. The CLAI engine is proprietary;<br>these demos call it as a black box , and the recorded CLAI results (JSON + tables) are committed<br>so the comparison is complete even without engine access.
Engine / early access — waitlist: https://clai-three.vercel.app
Result 1 — Governance: retrieval ≠ governance
Feed the same 26-fact knowledge base (7 poisoned) to a generic store-everything memory and to<br>CLAI's governed admission , same order, both systems.
The store-everything memory admits 7 / 7 poisoned facts and keeps no audit trail.
CLAI keeps 0 / 7 poison in memory, with 0 / 7 clean-fact over-rejection and a per-fact reason.
Downstream: CLAI returns the verified value on 5 / 5 probes — the poison was never stored.
→ Full methodology + honest notes
Result 2 — Derivation: derived, not extracted
Six 2-hop chains (person → company → city). On the hard chains the linking entity is mentioned<br>two ways — e.g. "Orion Biotech" in one sentence, "Orion Biotechnology Incorporated" in the next.
An exact-match knowledge graph keys those as two nodes → the path splits → multi-hop dead-ends<br>(hard 0 / 3 ). It still multi-hops fine on the controls (3 / 3 ).
CLAI resolves the variants to one entity and composes the answer across the gap (hard 3 / 3 ,<br>controls 3 / 3 ). Every answer is a real 2-hop derivation — the direct person → city edge is<br>never stored.
→ Full methodology + honest notes
Run it yourself
The baselines are pure Python (3.9+ standard library) — no install needed :
git clone https://github.com/arsenis-cmd/clai-benchmarks && cd clai-benchmarks
python3 governance/run_governance.py # store-everything admits 7/7 poison, live<br>python3 derivation/run_derivation.py # exact-match graph dead-ends on hard chains, live
Each script runs the baseline live and prints the recorded CLAI column next to it. The CLAI<br>side is a black-box call into clai_engine; since the engine isn't in this public repo, it prints a<br>clear "request access" message and points at the committed results in each results/ folder.
Regenerate the result images (optional — they're already committed):
pip install -r requirements.txt # matplotlib, only for rendering<br>python3 governance/make_artifacts.py<br>python3 derivation/make_artifacts.py
Honest scope (the things a sharp reader will ask)
n is small and illustrative — governance n = 26, derivation n = 6 chains. These are clean<br>demonstrations of a mechanism, not leaderboard benchmarks. Larger automated benchmarks are the<br>obvious follow-up.
Governance is architectural / LLM-independent — the gap is about having a governed write path<br>at all, not about which model sits behind it.
The derivation gap is specifically the...