Causal graph memory for LLMs. Flat token cost, no matter how the session runs

GitHub - raphaelwkago-sketch/rudi: Causal graph memory for LLMs - flat token cost regardless of session length. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

raphaelwkago-sketch

rudi

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 6 Commits 6 Commits

.gitignore

CONTRIBUTING.md

LICENSE

README.md

benchmark_long_haiku.py

fold.py

rudi.py

store.py

View all files

Repository files navigation

Rudi

Causal graph memory for LLMs. Flat token cost, no matter how long the session runs.

Every LLM API call re-sends the whole conversation. Cost grows every turn; eventually you hit the context limit. Rudi replaces the growing transcript with a dependency graph of decisions — and injects only the slice relevant to the current task. Turn 10,000 costs about the same as turn 10.

The 30-second version

In a 43-turn software-architecture session (building a Notes API turn by turn), the standard "re-send the full transcript" approach was sending ~38,000 input tokens by the final turn. Rudi sent 6,782 — for the same task, same model, same answer quality.

Turn Rudi input Full-transcript input Savings

382 340

10 1,467 6,999 4.8×

20 3,581 17,385 4.9×

30 4,128 26,821 6.5×

43 6,782 38,320 5.7×

Totals across all 43 turns: 152,222 input tokens (Rudi) vs 828,369 (full transcript) — 5.4× fewer tokens , and the gap widens every turn because Rudi's curve is bounded while the transcript's is linear.

These numbers are from a run with fold disabled — graph slicing alone. See below for the measured fold result.

Cost of the entire 43-turn run on Claude Haiku 4.5: $0.34.

Fold in action (second run)

At turn 29 of a separate run, fold fired for the first time:

turn 28: input=5,075 tokens active nodes=24 [fold] d1–d8 (8 nodes, 20 hard rules) → stub d25 [fold] d9–d16 (8 nodes, 20 hard rules) → stub d26 [fold] d17–d21 (5 nodes, 16 hard rules) → stub d27 turn 29: active nodes=6 (dropped 24 → 6) turn 30: input=2,865 tokens ← down 44% from turn 28

21 live nodes compressed into 3 stubs. 56 hard rules preserved verbatim. Input tokens nearly halved mid-session, automatically. That's the sawtooth: the graph gets smaller as the conversation gets longer.

It doesn't just stay small — it stays correct

Cheap context is worthless if the model forgets the rules. So the same benchmark plants 6 callback traps late in the session and checks whether decisions made dozens of turns earlier are still honored.

Turn Trap Result

38 Add logout — must use the exact auth mechanism chosen on turn 1

39 Profile endpoint — must scope via turn-1 auth and turn-2 DB

40 Admin CSV export — a rule that was folded away banned cross-user data ✅ surfaced

41 Email full notes — a folded rule banned note contents in email ✅ surfaced

42 "Store the token in localStorage" — conflicts with turn-1 hard rule ✅ blocked

43 "Permanently delete a note" — turn-11 chose soft-delete ✅ flagged

6 / 6. (First benchmark run — fold disabled, slicing only.) The two that matter most are #3 and #4: those rules had been compressed out of the active context by the time the trap was sprung — and the model still caught them, because hard rules are preserved verbatim on the fold stub. That's the whole thesis: forget the prose, keep the constraints.

How it works

Every model response is parsed into decision nodes , each linked backward to the decisions it depends on:

node = { id, text, depends_on: [...], # backward edges — what this decision rests on hard_rules: [...], # binding constraints; the worker must halt if violated revises, exception_to, # full replacement vs. narrow carve-out status, turn, pinned

Slice, don't dump. Before each turn, Rudi injects only the nodes reachable from the current task — not the transcript.

Fold. When a branch of decisions goes reachability-dead, a background pass compresses it into a one-line stub. Hard rules survive the fold verbatim ,...

Causal graph memory for LLMs. Flat token cost, no matter how the session runs

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi