Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

zozo123-IB1 pts0 comments

Inferoa — watch an agent loop spend tokens

INFEROA / FLIGHT RECORDER<br>announcement · built on the vLLM stack

Your agent harness is<br>burning tokens.<br>Watch it happen.

Inferoa is an inference-native agent harness for long-horizon coding work.<br>It treats the mechanics and cost of inference — prefix caches, context shape, model routing —<br>as the design constraint, not an afterthought. Below: a turn-by-turn simulation of the same<br>coding task run through a naive harness vs. Inferoa.

▶ RUN AGENT TASK<br>SPEED 1×<br>RESET<br>task: “fix the failing auth test” · 8 turns

What this is: the page below is a client-side simulation, parameterized from the<br>results reported in the Inferoa announcement (90.0% cached-token discount, 80.8% context<br>reduction via CodeGraph, 61.4% tool-output reduction via RTK). But the mechanics are real —<br>we also ran the real thing and kept the receipts ↓

Receipts — what's real.

A simulation is legible; receipts are credible. We ran the real stack in isolated<br>islo.dev sandboxes<br>and kept the artifacts. Everything below actually executed.

REAL PREFIX CACHE · vLLM /metrics

97.8%

measured cache-hit rate

Real Inferoa v0.11.0 (npm) drove a real<br>vLLM v0.22.1 server (Qwen2.5-0.5B, CPU). vLLM's own Prometheus<br>counters after the run: prefix_cache_hits_total 1,611,008 of<br>queries_total 1,647,574. The harness's byte-stable prompt<br>prefix beat the announcement's 90% claim.

REAL AGENT WORK · MERGED TO MAIN

2 lines

failing test → fix → green, in a sandbox

An agent in an isolated sandbox was handed a repo with a genuinely failing test<br>(tz-naive vs tz-aware datetime) and a task prompt that named the suspected cause. It produced<br>the 2-line fix, re-ran pytest to green, and the change is<br>merged on main. Verbatim before/after output is embedded in<br>step 7 below.<br>What's proven: the sandboxed execute-verify-publish workflow — not unguided agent diagnosis.

REAL WIRING · TWO SANDBOXES

vLLM in one, Inferoa in the other

Sandbox A serves vLLM on :8000; islo share exposed it at a<br>public *.share.islo.dev URL; sandbox B's Inferoa pointed its<br>base_url there. Inferoa's event log records the proof:<br>provider_id: vllm:openai_compatible:https://…share.islo.dev/v1,<br>prompt_tokens: 16,829 per turn, with stable prompt/tool-schema hashes — cache discipline, visible.

HONEST LIMITS

what we don't claim

The 0.5B CPU model proves the mechanics (caching, routing, harness loop), not frontier-grade<br>coding — it made a real but adorably confused tool call. Sim pricing is illustrative and published in the<br>README.<br>Share URLs expire in 24h; the repos and PR are permanent.

TOKENS PROCESSED

PREFIX CACHE HIT RATE

$0.0000<br>SPEND AVOIDED VS NAIVE

0 / 8<br>AGENT TURNS

AGENT STREAMIDLE

// press RUN to start the agent loop…

CONTEXT WINDOW

0 tokens128k window

cached prefix (≈10% price)<br>fresh input<br>output

SEMANTIC ROUTER

// routing decisions appear here

CUMULATIVE SPEND — SAME TASK, TWO HARNESSES

naive harness — full resend, raw tool dumps, frontier-only$0.0000

inferoa — prefix cache + codegraph + rtk + routing$0.0000

The real run, end to end.

Every block below is captured output from the actual run on 2026-06-10 — three isolated<br>islo.dev sandboxes,<br>real vLLM, real Inferoa, real metrics. Nothing here is mocked. Two pipelines, two results:

⌨ laptop — the islo CLI creates three isolated sandboxes on islo.dev · no GPU, no devops, one command each

Pipeline 1 · measure the inference claim

SANDBOX B<br>inferoa@0.11.0 the actual harness, installed from npm

real inference requests · HTTPS

PUBLIC URL · islo share<br>https://…share.islo.dev sandbox port 8000, exposed in one command

SANDBOX A<br>vLLM v0.22.1 · Qwen2.5-0.5B · CPU prefix_caching = ON

read off vLLM's own /metrics

RESULT<br>97.8% cache-hit rate 1,611,008 / 1,647,574 prompt tokens — claim was 90%

Pipeline 2 · prove the agent workflow

SANDBOX C<br>coding agent + failing repo TypeError: naive vs aware datetime comparison

pytest red → 2-line fix → pytest green

GITHUB · MAIN<br>fix merged (ffda3d7) verbatim before/after output in step 7 below

RESULT<br>0 → 2 tests passing execute → verify → publish, fully isolated

What happened: the harness in B drove the model in A through a public islo.dev URL, and vLLM's own<br>counters measured the cache. The agent in C turned a failing repo green. Every box maps to a numbered step below with its captured output.

97.8%<br>REAL CACHE-HIT RATE (vLLM /metrics)

1.65M<br>PROMPT TOKENS THROUGH vLLM

138<br>SUCCESSFUL MODEL REQUESTS

0 → 2<br>TESTS PASSING AFTER AGENT FIX

Spin up a vLLM sandbox

One CLI command turns the official vLLM CPU image into a running sandbox. Note the engine config: enable_prefix_caching=True .

$ islo use vllm-cpu-1781079422 -i docker.io/vllm/vllm-openai-cpu:latest-x86_64 \<br>--cpu 6 --memory 12288 --disk 30<br>✓ Sandbox 'vllm-cpu-1781079422' created<br>$ vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --max-model-len 32768<br>(EngineCore) INFO core.py:112 Initializing a V1 LLM engine (v0.22.1) with config:<br>model='Qwen/Qwen2.5-0.5B-Instruct',...

vllm real agent inferoa islo sandbox

Related Articles