Inferoa — watch an agent loop spend tokens
INFEROA / FLIGHT RECORDER<br>announcement · built on the vLLM stack
Your agent harness is<br>burning tokens.<br>Watch it happen.
Inferoa is an inference-native agent harness for long-horizon coding work.<br>It treats the mechanics and cost of inference — prefix caches, context shape, model routing —<br>as the design constraint, not an afterthought. Below: a turn-by-turn simulation of the same<br>coding task run through a naive harness vs. Inferoa.
▶ RUN AGENT TASK<br>SPEED 1×<br>RESET<br>task: “fix the failing auth test” · 8 turns
What this is: the page below is a client-side simulation, parameterized from the<br>results reported in the Inferoa announcement (90.0% cached-token discount, 80.8% context<br>reduction via CodeGraph, 61.4% tool-output reduction via RTK). But the mechanics are real —<br>we also ran the real thing and kept the receipts ↓
Receipts — what's real.
A simulation is legible; receipts are credible. We ran the real stack in isolated<br>islo.dev sandboxes<br>and kept the artifacts. Everything below actually executed.
REAL PREFIX CACHE · vLLM /metrics
97.8%
measured cache-hit rate
Real Inferoa v0.11.0 (npm) drove a real<br>vLLM v0.22.1 server (Qwen2.5-0.5B, CPU). vLLM's own Prometheus<br>counters after the run: prefix_cache_hits_total 1,611,008 of<br>queries_total 1,647,574. The harness's byte-stable prompt<br>prefix beat the announcement's 90% claim.
REAL AGENT WORK · MERGED TO MAIN
2 lines
failing test → fix → green, in a sandbox
An agent in an isolated sandbox was handed a repo with a genuinely failing test<br>(tz-naive vs tz-aware datetime) and a task prompt that named the suspected cause. It produced<br>the 2-line fix, re-ran pytest to green, and the change is<br>merged on main. Verbatim before/after output is embedded in<br>step 7 below.<br>What's proven: the sandboxed execute-verify-publish workflow — not unguided agent diagnosis.
REAL WIRING · TWO SANDBOXES
vLLM in one, Inferoa in the other
Sandbox A serves vLLM on :8000; islo share exposed it at a<br>public *.share.islo.dev URL; sandbox B's Inferoa pointed its<br>base_url there. Inferoa's event log records the proof:<br>provider_id: vllm:openai_compatible:https://…share.islo.dev/v1,<br>prompt_tokens: 16,829 per turn, with stable prompt/tool-schema hashes — cache discipline, visible.
HONEST LIMITS
what we don't claim
The 0.5B CPU model proves the mechanics (caching, routing, harness loop), not frontier-grade<br>coding — it made a real but adorably confused tool call. Sim pricing is illustrative and published in the<br>README.<br>Share URLs expire in 24h; the repos and PR are permanent.
TOKENS PROCESSED
PREFIX CACHE HIT RATE
$0.0000<br>SPEND AVOIDED VS NAIVE
0 / 8<br>AGENT TURNS
AGENT STREAMIDLE
// press RUN to start the agent loop…
CONTEXT WINDOW
0 tokens128k window
cached prefix (≈10% price)<br>fresh input<br>output
SEMANTIC ROUTER
// routing decisions appear here
CUMULATIVE SPEND — SAME TASK, TWO HARNESSES
naive harness — full resend, raw tool dumps, frontier-only$0.0000
inferoa — prefix cache + codegraph + rtk + routing$0.0000
The real run, end to end.
Every block below is captured output from the actual run on 2026-06-10 — three isolated<br>islo.dev sandboxes,<br>real vLLM, real Inferoa, real metrics. Nothing here is mocked. Two pipelines, two results:
⌨ laptop — the islo CLI creates three isolated sandboxes on islo.dev · no GPU, no devops, one command each
Pipeline 1 · measure the inference claim
SANDBOX B<br>inferoa@0.11.0 the actual harness, installed from npm
real inference requests · HTTPS
PUBLIC URL · islo share<br>https://…share.islo.dev sandbox port 8000, exposed in one command
SANDBOX A<br>vLLM v0.22.1 · Qwen2.5-0.5B · CPU prefix_caching = ON
read off vLLM's own /metrics
RESULT<br>97.8% cache-hit rate 1,611,008 / 1,647,574 prompt tokens — claim was 90%
Pipeline 2 · prove the agent workflow
SANDBOX C<br>coding agent + failing repo TypeError: naive vs aware datetime comparison
pytest red → 2-line fix → pytest green
GITHUB · MAIN<br>fix merged (ffda3d7) verbatim before/after output in step 7 below
RESULT<br>0 → 2 tests passing execute → verify → publish, fully isolated
What happened: the harness in B drove the model in A through a public islo.dev URL, and vLLM's own<br>counters measured the cache. The agent in C turned a failing repo green. Every box maps to a numbered step below with its captured output.
97.8%<br>REAL CACHE-HIT RATE (vLLM /metrics)
1.65M<br>PROMPT TOKENS THROUGH vLLM
138<br>SUCCESSFUL MODEL REQUESTS
0 → 2<br>TESTS PASSING AFTER AGENT FIX
Spin up a vLLM sandbox
One CLI command turns the official vLLM CPU image into a running sandbox. Note the engine config: enable_prefix_caching=True .
$ islo use vllm-cpu-1781079422 -i docker.io/vllm/vllm-openai-cpu:latest-x86_64 \<br>--cpu 6 --memory 12288 --disk 30<br>✓ Sandbox 'vllm-cpu-1781079422' created<br>$ vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --max-model-len 32768<br>(EngineCore) INFO core.py:112 Initializing a V1 LLM engine (v0.22.1) with config:<br>model='Qwen/Qwen2.5-0.5B-Instruct',...