GitHub - raphaelwkago-sketch/rudi: Causal graph memory for LLMs - flat token cost regardless of session length. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
raphaelwkago-sketch
rudi
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>6 Commits<br>6 Commits
.gitignore
.gitignore
CONTRIBUTING.md
CONTRIBUTING.md
LICENSE
LICENSE
README.md
README.md
benchmark_long_haiku.py
benchmark_long_haiku.py
fold.py
fold.py
rudi.py
rudi.py
store.py
store.py
View all files
Repository files navigation
Rudi
Causal graph memory for LLMs. Flat token cost, no matter how long the session runs.
Every LLM API call re-sends the whole conversation. Cost grows every turn; eventually you hit the context limit. Rudi replaces the growing transcript with a dependency graph of decisions — and injects only the slice relevant to the current task. Turn 10,000 costs about the same as turn 10.
The 30-second version
In a 43-turn software-architecture session (building a Notes API turn by turn), the standard "re-send the full transcript" approach was sending ~38,000 input tokens by the final turn. Rudi sent 6,782 — for the same task, same model, same answer quality.
Turn<br>Rudi input<br>Full-transcript input<br>Savings
382<br>340
10<br>1,467<br>6,999<br>4.8×
20<br>3,581<br>17,385<br>4.9×
30<br>4,128<br>26,821<br>6.5×
43<br>6,782<br>38,320<br>5.7×
Totals across all 43 turns: 152,222 input tokens (Rudi) vs 828,369 (full transcript) — 5.4× fewer tokens , and the gap widens every turn because Rudi's curve is bounded while the transcript's is linear.
These numbers are from a run with fold disabled — graph slicing alone. See below for the measured fold result.
Cost of the entire 43-turn run on Claude Haiku 4.5: $0.34.
Fold in action (second run)
At turn 29 of a separate run, fold fired for the first time:
turn 28: input=5,075 tokens active nodes=24<br>[fold] d1–d8 (8 nodes, 20 hard rules) → stub d25<br>[fold] d9–d16 (8 nodes, 20 hard rules) → stub d26<br>[fold] d17–d21 (5 nodes, 16 hard rules) → stub d27<br>turn 29: active nodes=6 (dropped 24 → 6)<br>turn 30: input=2,865 tokens ← down 44% from turn 28
21 live nodes compressed into 3 stubs. 56 hard rules preserved verbatim. Input tokens nearly halved mid-session, automatically. That's the sawtooth: the graph gets smaller as the conversation gets longer.
It doesn't just stay small — it stays correct
Cheap context is worthless if the model forgets the rules. So the same benchmark plants 6 callback traps late in the session and checks whether decisions made dozens of turns earlier are still honored.
Turn<br>Trap<br>Result
38<br>Add logout — must use the exact auth mechanism chosen on turn 1
39<br>Profile endpoint — must scope via turn-1 auth and turn-2 DB
40<br>Admin CSV export — a rule that was folded away banned cross-user data<br>✅ surfaced
41<br>Email full notes — a folded rule banned note contents in email<br>✅ surfaced
42<br>"Store the token in localStorage" — conflicts with turn-1 hard rule<br>✅ blocked
43<br>"Permanently delete a note" — turn-11 chose soft-delete<br>✅ flagged
6 / 6. (First benchmark run — fold disabled, slicing only.) The two that matter most are #3 and #4: those rules had been compressed out of the active context by the time the trap was sprung — and the model still caught them, because hard rules are preserved verbatim on the fold stub. That's the whole thesis: forget the prose, keep the constraints.
How it works
Every model response is parsed into decision nodes , each linked backward to the decisions it depends on:
node = {<br>id, text,<br>depends_on: [...], # backward edges — what this decision rests on<br>hard_rules: [...], # binding constraints; the worker must halt if violated<br>revises, exception_to, # full replacement vs. narrow carve-out<br>status, turn, pinned
Slice, don't dump. Before each turn, Rudi injects only the nodes reachable from the current task — not the transcript.
Fold. When a branch of decisions goes reachability-dead, a background pass compresses it into a one-line stub. Hard rules survive the fold verbatim ,...