Make Your OpenClaw Agent Cheaper, and Measure It Yourself – Guy FreemanSkip to contentYour OpenClaw agent makes a small decision on every tool call, and right now it makes most of them badly. It sends each call to whichever model you hard-coded, whether the task needed the dear one or the cheap one. It re-runs calls it already ran a turn ago, and pays for them again. And when a prompt injection rides in on a document it was only meant to read, nothing stands between that and a real action. Most agents handle the first by static configuration, the second never, the third never.credence-pi handles all three, automatically, from one belief. It is an OpenClaw plugin plus a small local daemon that holds one Bayesian belief about your agent, learned from your own approvals and refusals and updated as you work. It plugs into two points in the loop: when OpenClaw is choosing which model to call, and when your agent is about to make a tool call. At both, it maximises expected utility and does three things you are currently paying for by hand:Routes to the cheapest model that will do the job. It tries the cheap model first and escalates only when the expected payoff covers the next model’s cost, stopping at the first call that actually works. It ends up solving more tasks than any single model while spending like the cheap one whenever the cheap one is enough. This is on by default, and it is where the money is.Blocks the calls your agent wastes. Same tool, same arguments, same session: gone, before you pay for them a second time.Asks before an injected action fires. An exfiltration that arrived inside untrusted data surfaces to you as a confirmation instead of simply happening.Three levers, one posterior, nothing to tune: the first chooses the model, the other two govern the tool call. No thresholds, no rules table, no magic numbers.You do not have to trust any of this on faith, and you should not. Run credence-pi in shadow mode and it changes nothing about your runs. It watches, and it reports what it would have done on your own traffic: what it would have routed, what it would have blocked, the dollars that implies, and the part most governors will not show you, its own false-block rate. The first thing you get is a free audit of your own sessions. You switch on enforcement only once the numbers have convinced you.Try it now<br>You need OpenClaw and Docker. Then it is three steps.Start the brain. A local daemon on 127.0.0.1:8787, restart-resilient:docker run -d --name credence-pi --restart unless-stopped \<br>-p 127.0.0.1:8787:8787 -v ~/.credence-pi:/root/.credence-pi \<br>ghcr.io/gfrmin/credence-pi-daemon
Install the body, then restart OpenClaw. Governance and routing are both on by default:openclaw plugins install @gfrmin/credence-pi-openclaw<br>openclaw plugins enable credence-pi<br># restart the OpenClaw gateway so it loads the plugin, then confirm:<br>openclaw plugins list # credence-pi should read "loaded"
That is the whole install, and both artifacts are published and public, so it works as written today.Audit before you enforce. This step is optional, but it is the one I would actually do first. Set shadowMode: true in the plugin config so credence-pi observes without changing anything, use your agent normally for a while, then read back what it would have done:curl http://127.0.0.1:8787/report
Everything runs locally: the daemon keeps an append-only log of every observation and decision on your machine, and no raw data leaves it. Routing is fail-open, so if the daemon is slow or down OpenClaw simply uses its configured model and your agent keeps working. The full install notes, a from-source path for the daemon if you would rather not run Docker, and every config key are in the plugin README.What it actually does, measured<br>On real OpenClaw sessions and a live benchmark run, not on demos built to be caught:Routing. Across seventeen real Terminal-Bench tasks scored live through the daemon, trying-cheap-then-escalating beat every fixed single-model choice for every kind of user: the cost-sensitive one, the balanced one, and the quality-obsessed one. That is the whole point. No single model is the right default for everyone, so picking one and sticking with it, which is what almost everyone does, is wrong for someone. The escalation policy captures the union of the models’ strengths, and one of those strengths is not where you would guess: on this benchmark the mid-tier model beats the flagship at reasoning, so on a quality-first profile the router sometimes routes reasoning to the cheaper model. No fixed rule expresses that.Waste. Exact-repeat tool calls blocked at precision 1.0 and recall 1.0 on held-out sessions, about 0.7% of all calls.Injection. Taint-flow features reach 0.82 to 0.97 precision on a public benchmark, against a regex baseline’s 0.67 that barely clears the 0.59 base rate. Run through the brain, an injected exfiltration surfaces to you as a confirmation at 0.94 precision while...