Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

Inferoa — watch an agent loop spend tokens

INFEROA / FLIGHT RECORDER announcement · built on the vLLM stack

Your agent harness is burning tokens. Watch it happen.

Inferoa is an inference-native agent harness for long-horizon coding work. It treats the mechanics and cost of inference — prefix caches, context shape, model routing — as the design constraint, not an afterthought. Below: a turn-by-turn simulation of the same coding task run through a naive harness vs. Inferoa.

▶ RUN AGENT TASK SPEED 1× RESET task: “fix the failing auth test” · 8 turns

What this is: the page below is a client-side simulation, parameterized from the results reported in the Inferoa announcement (90.0% cached-token discount, 80.8% context reduction via CodeGraph, 61.4% tool-output reduction via RTK). But the mechanics are real — we also ran the real thing and kept the receipts ↓

Receipts — what's real.

A simulation is legible; receipts are credible. We ran the real stack in isolated islo.dev sandboxes and kept the artifacts. Everything below actually executed.

REAL PREFIX CACHE · vLLM /metrics

97.8%

measured cache-hit rate

Real Inferoa v0.11.0 (npm) drove a real vLLM v0.22.1 server (Qwen2.5-0.5B, CPU). vLLM's own Prometheus counters after the run: prefix_cache_hits_total 1,611,008 of queries_total 1,647,574. The harness's byte-stable prompt prefix beat the announcement's 90% claim.

REAL AGENT WORK · MERGED TO MAIN

2 lines

failing test → fix → green, in a sandbox

An agent in an isolated sandbox was handed a repo with a genuinely failing test (tz-naive vs tz-aware datetime) and a task prompt that named the suspected cause. It produced the 2-line fix, re-ran pytest to green, and the change is merged on main. Verbatim before/after output is embedded in step 7 below. What's proven: the sandboxed execute-verify-publish workflow — not unguided agent diagnosis.

REAL WIRING · TWO SANDBOXES

vLLM in one, Inferoa in the other

Sandbox A serves vLLM on :8000; islo share exposed it at a public *.share.islo.dev URL; sandbox B's Inferoa pointed its base_url there. Inferoa's event log records the proof: provider_id: vllm:openai_compatible:https://…share.islo.dev/v1, prompt_tokens: 16,829 per turn, with stable prompt/tool-schema hashes — cache discipline, visible.

HONEST LIMITS

what we don't claim

The 0.5B CPU model proves the mechanics (caching, routing, harness loop), not frontier-grade coding — it made a real but adorably confused tool call. Sim pricing is illustrative and published in the README. Share URLs expire in 24h; the repos and PR are permanent.

TOKENS PROCESSED

PREFIX CACHE HIT RATE

$0.0000 SPEND AVOIDED VS NAIVE

0 / 8 AGENT TURNS

AGENT STREAMIDLE

// press RUN to start the agent loop…

CONTEXT WINDOW

0 tokens128k window

cached prefix (≈10% price) fresh input output

SEMANTIC ROUTER

// routing decisions appear here

CUMULATIVE SPEND — SAME TASK, TWO HARNESSES

naive harness — full resend, raw tool dumps, frontier-only$0.0000

inferoa — prefix cache + codegraph + rtk + routing$0.0000

The real run, end to end.

Every block below is captured output from the actual run on 2026-06-10 — three isolated islo.dev sandboxes, real vLLM, real Inferoa, real metrics. Nothing here is mocked. Two pipelines, two results:

⌨ laptop — the islo CLI creates three isolated sandboxes on islo.dev · no GPU, no devops, one command each

Pipeline 1 · measure the inference claim

SANDBOX B inferoa@0.11.0 the actual harness, installed from npm

real inference requests · HTTPS

PUBLIC URL · islo share https://…share.islo.dev sandbox port 8000, exposed in one command

SANDBOX A vLLM v0.22.1 · Qwen2.5-0.5B · CPU prefix_caching = ON

read off vLLM's own /metrics

RESULT 97.8% cache-hit rate 1,611,008 / 1,647,574 prompt tokens — claim was 90%

Pipeline 2 · prove the agent workflow

SANDBOX C coding agent + failing repo TypeError: naive vs aware datetime comparison

pytest red → 2-line fix → pytest green

GITHUB · MAIN fix merged (ffda3d7) verbatim before/after output in step 7 below

RESULT 0 → 2 tests passing execute → verify → publish, fully isolated

What happened: the harness in B drove the model in A through a public islo.dev URL, and vLLM's own counters measured the cache. The agent in C turned a failing repo green. Every box maps to a numbered step below with its captured output.

97.8% REAL CACHE-HIT RATE (vLLM /metrics)

1.65M PROMPT TOKENS THROUGH vLLM

138 SUCCESSFUL MODEL REQUESTS

0 → 2 TESTS PASSING AFTER AGENT FIX

Spin up a vLLM sandbox

One CLI command turns the official vLLM CPU image into a running sandbox. Note the engine config: enable_prefix_caching=True .

$ islo use vllm-cpu-1781079422 -i docker.io/vllm/vllm-openai-cpu:latest-x86_64 \ --cpu 6 --memory 12288 --disk 30 ✓ Sandbox 'vllm-cpu-1781079422' created $ vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --max-model-len 32768 (EngineCore) INFO core.py:112 Initializing a V1 LLM engine (v0.22.1) with config: model='Qwen/Qwen2.5-0.5B-Instruct',...

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs