A cheap fix that saves the AI $400M dollars a year and brings 4B people online

Codec — Token-native binary transport for AI APIs

v0.4.1 shipping · source-available · what's new →

The control plane for AI inference. AI inference is burning megawatts of GPU power and datacenter buildout is racing to keep up — meanwhile your inference stack is paying again at every hop on top of the GPU bill. Models think in tokens, but the rest of the stack speaks text. Every gateway, router, tool dispatcher, and middleware in the path does the same ritual: detokenize the model's IDs to text, encode as UTF-8, wrap in JSON, ship it, parse it, decode UTF-8, re-tokenize back to IDs — burning CPU, memory, and latency on lossy conversions the AI never asked for, and risking KV-cache corruption when the re-tokenize doesn't round-trip cleanly. Codec is a drop-in upgrade that keeps token IDs as the wire format end-to-end: gateways forward IDs verbatim, tool dispatchers match on raw IDs, cross-model handoffs translate vocabularies in-process. Same model, same prompts, same answers; typically 16× less data on the wire on real agent traffic, up to ~1,700× when the content compresses well — how big the win is depends on what your AI generates, full receipts below. On mobile: snappier app, lighter cloud bill. At fleet scale: megawatt-hours of network energy and middleware CPU not burned on bytes nobody reads. Plug-in libraries for TypeScript, Python, Rust, Java, .NET, and C work with the AI servers you already use (sglang, vllm, llama.cpp). Your code doesn't change. We can't make the model smaller — the waste, we can. And by shrinking the wire 1,000+×, Codec opens AI access to the ~5 billion people on slow, expensive, or metered connections that JSON-SSE prices out of the door.

What it gives you → Protocol map github / wdunn001/Codec ~$400M+/yr total wire + GPU savings worldwide ~$320M cloud egress (heavy-agent baseline — tool-use + A2A is default at Claude/ChatGPT/Gemini) + ~$50M GPU on blocked prompts + ~$35M Starlink; sub-agent-heavy flows push to $500–700M/yr up to 10× faster on mobile 2 K-token reply over 10 Mbps 4G ~400 cars/yr off the road today, ~4,000 by 2030 counts bidirectional + ~8 round-trips per visible reply (the heavy-agent topology every major provider runs today) + ~10% client-side blocked doomed prompts ~5B people AI accessible where it wasn't 2.6B offline + 2–3B on slow / expensive mobile (ITU 2024) JSON ~ 1 byte / character 142 bytes bottom inside the clipped funnel --> 4× shorter on the wire CodeC 8 token IDs · 4 bytes each (uint32) 32 bytes Token IDs straight on the wire. Tool-call dispatch, observability, cross-vocab handoff — all the things you'd want to do at the inference layer reduce to integer compares on the stream. Detokenize becomes a byproduct, not a per-token cost.

control-plane primitives Three operations. All on raw token IDs.

Codec gives the inference layer the same primitives a service mesh gives a microservice fleet: route, dispatch, translate. Run them on raw uint32 tokens, never on text. The compression you see in the receipts below is what falls out for free when you stop reserializing every hop.

Models think in tokens. Every middleware in your stack — gateway, router, log sink — speaks text, so it detokenizes, JSON-wraps, ships, parses, re-tokenizes — once per hop, burning CPU and risking KV-cache drift. Codec keeps token IDs as the wire format end-to-end; UTF-8 happens once, at the edge that actually displays text. Same compression options on top (gzip / brotli / dict-zstd). Same framing on every engine; six client languages decode byte-identically.

16–1700×less wire (workload-dep.) 3engines, one wire

The MCP path normally tokenizes the tool result at the gateway, every call. A Codec-aware MCP server (codec-time-leaf) attaches token IDs to its result via _meta['ai.codec/leaf-tokenization']; the codec-metamcp gateway forwards them verbatim — [Codec][leaf] fires, the gateway becomes a transparent ID pipe, and the consumer skips its BPE re-tokenize. tools/list across a 40-tool namespace: 21.4 KB → 5.9 KB (3.6×) . ToolWatcher detects tool boundaries on the raw ID stream at 26.7× the speed of detokenize+regex (lab EPYC, 481 Mtok/s).

12.4×leaf consumer CPU 26.7×ToolWatcher vs detok+regex

A Llama-3 agent's stream feeds a Qwen-2 agent through one in-process detokenize / retokenize step. UTF-8 never crosses the wire. At 2 K tokens the Codec path ships 15.1× fewer wire bytes (10.4 KB → 709 B) at bridge CPU within noise of the JSON-SSE+retokenize path. Both paths emit byte-identical Qwen-2 output; the bench asserts strict equality before reporting numbers.

15.1×smaller wire @ 2K ≡byte-identical output

The same wire format extends to diffusion models: VAE latents stream in length-prefixed binary frames instead of decoded pixels. The client runs vae_decode locally; pixels never touch the wire. Measured on the lab against codec-diffusers running SD-1.5: a...

A cheap fix that saves the AI $400M dollars a year and brings 4B people online

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast