The AI Token plumbing issue

Lago Blog | AI Billing is (mostly) token plumbing

Log inBook demo

All articles

May 26 /Product /5 min read

AI Billing is (mostly) token plumbing

Raffi Sarkissian

Share on

Why we built the Lago Agent SDK, and what we're shipping next. We just released the Lago Agent SDK. Two libraries, Python and TypeScript. They wrap your LLM client and send token usage to Lago for billing. That's the surface. The point is what you stop doing. The token plumbing Every team that shipped an AI feature in the last 18 months built the same thing. Smart search, inbox triage, meeting summaries, coding agents, vibe-coded apps. All of them ended up writing token-extraction middleware. The middleware is the same job, repeated everywhere. Call an LLM. Parse the response for token counts. Attribute the call to a customer. Send the count to a billing system. Repeat for every provider, every model family, every streaming response, every retry, every cached call. Every provider returns usage in a different shape: openai_resp.usage.prompt_tokens anthropic_resp.usage.input_tokens # plus cache_creation_input_tokens, cache_read_input_tokens bedrock_resp["usage"]["inputTokens"] # camelCase, dict access, no cache fields at this level

Cache tokens have sub-types. Streaming responses bury usage in the last event, sometimes. Reasoning tokens are folded into output on some models, broken out on others. The schemas change every quarter. This is the token plumbing. Not differentiating, not what your AI feature is for, and it breaks every time a provider ships an update. Two audiences, same plumbing The B2B SaaS team adding AI to an existing product. Intercom shipping Fin on top of seat-based pricing. Notion layering AI as a per-seat add-on. Atlassian Intelligence rolling out across Jira and Confluence. The team has billed per-seat for years and now needs to charge for inference-backed features without rewriting the engine. Product wants AI live in two weeks. Engineering owns a sidecar nobody wants to maintain. The CFO wants to know if the feature has positive margin. Nobody can answer cleanly because token data lives in logs, not invoices. The AI-native team building on top of LLMs. Cursor, Lovable, Replit, voice and browser agents. They pay a per-token rate to a model provider and bill the user with margin on top. Cost-plus, end to end. Every point of margin matters because COGS is variable per-customer and tracked in real time. Under-count and they bleed margin. Over-count and they lose trust. The middleware has to be exact, every release, for every model they add. Both groups built the same plumbing. We're tired of building it. Wrap once Before, billing an LLM call looked something like this. resp = client.converse(modelId="...", messages=[...])

usage = resp["usage"] billing.send_event(customer_id, "llm_input_tokens", usage["inputTokens"]) billing.send_event(customer_id, "llm_output_tokens", usage["outputTokens"]) billing.send_event(customer_id, "llm_cache_read", usage.get("cacheReadInputTokens", 0)) # ... repeat for cache writes, tool calls, reasoning tokens, streaming chunks # ... then write it all again, differently, for the next provider you add

After, you wrap the client once. # OpenAI client = sdk.wrap(OpenAI()) client.chat.completions.create(model="gpt-4o", messages=[...])

# Anthropic client = sdk.wrap(Anthropic()) client.messages.create(model="claude-sonnet-4-5", messages=[...])

# Bedrock client = sdk.wrap(boto3.client("bedrock-runtime")) client.converse(modelId="...", messages=[...])

# token attribution happens automatically, per customer, across every provider

What lands in billing tells the story. Old world. Anthropic returns one shape: "model": "claude-sonnet-4-5", "usage": { "input_tokens": 1200, "output_tokens": 340, "cache_creation_input_tokens": 800, "cache_read_input_tokens": 4000

OpenAI returns another: "model": "gpt-4o", "usage": { "prompt_tokens": 1200, "completion_tokens": 340, "prompt_tokens_details": { "cached_tokens": 4000 }

Different field names, different nesting, different cache semantics. You write one extractor per provider, map the fields, send one event per dimension. Then a model adds a new field and you do it again. New world. The SDK normalizes both into the same canonical shape and batches them to Lago: "external_subscription_id": "sub_acme", "events": [ { "code": "llm_input_tokens", "properties": { "value": 1200 } }, { "code": "llm_output_tokens", "properties": { "value": 340 } }, { "code": "llm_cached_input_tokens", "properties": { "value": 4000 } }, { "code": "llm_cache_creation_tokens","properties": { "value": 800 } }

Same event shape regardless of provider. Customer attribution is automatic. Cache fields populate when the provider returns them, stay absent when it doesn't. The wrapped client behaves identically to the original. Same arguments, same return shape, same exceptions. The SDK extracts usage from every response, normalizes it across providers, attributes it to a customer...

The AI Token plumbing issue

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models