Codec — Token-native binary transport for AI APIs
v0.4.1 shipping · source-available ·<br>what's new →
The control plane for AI inference.<br>AI inference is burning megawatts of GPU power and datacenter<br>buildout is racing to keep up — meanwhile your inference stack<br>is paying again at every hop on top of the GPU bill. Models<br>think in tokens, but the rest of the stack speaks text. Every gateway,<br>router, tool dispatcher, and middleware in the path does the same<br>ritual: detokenize the model's IDs to text, encode as UTF-8, wrap in<br>JSON, ship it, parse it, decode UTF-8, re-tokenize back to IDs —<br>burning CPU, memory, and latency on lossy conversions the AI never<br>asked for, and risking KV-cache corruption when the re-tokenize<br>doesn't round-trip cleanly. Codec is a drop-in upgrade that keeps<br>token IDs as the wire format end-to-end: gateways forward<br>IDs verbatim, tool dispatchers match on raw IDs, cross-model handoffs<br>translate vocabularies in-process. Same model, same prompts, same<br>answers; typically 16× less data on the wire on real<br>agent traffic, up to ~1,700× when the content compresses<br>well — how big the win is depends on what your AI<br>generates, full receipts below. On mobile: snappier app, lighter<br>cloud bill. At fleet scale: megawatt-hours of network energy and<br>middleware CPU not burned on bytes nobody reads. Plug-in<br>libraries for TypeScript, Python, Rust, Java, .NET, and C work with<br>the AI servers you already use (sglang, vllm, llama.cpp). Your code<br>doesn't change. We can't make the model smaller — the waste,<br>we can. And by shrinking the wire 1,000+×, Codec opens AI<br>access to the ~5 billion people on slow,<br>expensive, or metered connections that JSON-SSE prices out of<br>the door.
What it gives you → Protocol map github / wdunn001/Codec<br>~$400M+/yr total wire + GPU savings worldwide ~$320M cloud egress (heavy-agent baseline — tool-use + A2A is default at Claude/ChatGPT/Gemini) + ~$50M GPU on blocked prompts + ~$35M Starlink; sub-agent-heavy flows push to $500–700M/yr<br>up to 10× faster on mobile 2 K-token reply over 10 Mbps 4G<br>~400 cars/yr off the road today, ~4,000 by 2030 counts bidirectional + ~8 round-trips per visible reply (the heavy-agent topology every major provider runs today) + ~10% client-side blocked doomed prompts<br>~5B people AI accessible where it wasn't 2.6B offline + 2–3B on slow / expensive mobile (ITU 2024)<br>JSON ~ 1 byte / character 142 bytes bottom inside the clipped funnel --> 4× shorter on the wire CodeC 8 token IDs · 4 bytes each (uint32) 32 bytes<br>Token IDs straight on the wire. Tool-call dispatch, observability,<br>cross-vocab handoff — all the things you'd want to do at the<br>inference layer reduce to integer compares on the stream. Detokenize<br>becomes a byproduct, not a per-token cost.
control-plane primitives<br>Three operations. All on raw token IDs.
Codec gives the inference layer the same primitives a service mesh gives<br>a microservice fleet: route, dispatch, translate.<br>Run them on raw uint32 tokens, never on text. The<br>compression you see in the receipts below is what falls out for free<br>when you stop reserializing every hop.
Models think in tokens. Every middleware in your stack — gateway, router, log sink — speaks text, so it detokenizes, JSON-wraps, ships, parses, re-tokenizes — once per hop, burning CPU and risking KV-cache drift. Codec keeps token IDs as the wire format end-to-end; UTF-8 happens once, at the edge that actually displays text. Same compression options on top (gzip / brotli / dict-zstd). Same framing on every engine; six client languages decode byte-identically.
16–1700×less wire (workload-dep.)<br>3engines, one wire
The MCP path normally tokenizes the tool result at the gateway, every call. A Codec-aware MCP server<br>(codec-time-leaf)<br>attaches token IDs to its result via<br>_meta['ai.codec/leaf-tokenization']; the<br>codec-metamcp gateway forwards them<br>verbatim — [Codec][leaf] fires, the gateway becomes<br>a transparent ID pipe, and the consumer skips its BPE re-tokenize.<br>tools/list across a 40-tool namespace:<br>21.4 KB → 5.9 KB (3.6×) .<br>ToolWatcher detects tool boundaries on the raw ID stream at<br>26.7× the speed of detokenize+regex<br>(lab EPYC, 481 Mtok/s).
12.4×leaf consumer CPU<br>26.7×ToolWatcher vs detok+regex
A Llama-3 agent's stream feeds a Qwen-2 agent through one<br>in-process detokenize / retokenize step. UTF-8 never crosses<br>the wire. At 2 K tokens the Codec path ships<br>15.1× fewer wire bytes<br>(10.4 KB → 709 B) at bridge CPU within noise of<br>the JSON-SSE+retokenize path. Both paths emit byte-identical<br>Qwen-2 output; the bench asserts strict equality before<br>reporting numbers.
15.1×smaller wire @ 2K<br>≡byte-identical output
The same wire format extends to diffusion<br>models: VAE latents stream in length-prefixed binary frames<br>instead of decoded pixels. The client runs vae_decode<br>locally; pixels never touch the wire. Measured on the lab against<br>codec-diffusers running SD-1.5: a...