Lago Blog | AI Billing is (mostly) token plumbing
Log inBook demo
All articles
May 26<br>/Product<br>/5 min read
AI Billing is (mostly) token plumbing
Raffi Sarkissian
Share on
Why we built the Lago Agent SDK, and what we're shipping next.<br>We just released the Lago Agent SDK. Two libraries, Python and TypeScript. They wrap your LLM client and send token usage to Lago for billing. That's the surface.<br>The point is what you stop doing.<br>The token plumbing<br>Every team that shipped an AI feature in the last 18 months built the same thing. Smart search, inbox triage, meeting summaries, coding agents, vibe-coded apps. All of them ended up writing token-extraction middleware.<br>The middleware is the same job, repeated everywhere. Call an LLM. Parse the response for token counts. Attribute the call to a customer. Send the count to a billing system. Repeat for every provider, every model family, every streaming response, every retry, every cached call.<br>Every provider returns usage in a different shape:<br>openai_resp.usage.prompt_tokens<br>anthropic_resp.usage.input_tokens # plus cache_creation_input_tokens, cache_read_input_tokens<br>bedrock_resp["usage"]["inputTokens"] # camelCase, dict access, no cache fields at this level
Cache tokens have sub-types. Streaming responses bury usage in the last event, sometimes. Reasoning tokens are folded into output on some models, broken out on others. The schemas change every quarter.<br>This is the token plumbing. Not differentiating, not what your AI feature is for, and it breaks every time a provider ships an update.<br>Two audiences, same plumbing<br>The B2B SaaS team adding AI to an existing product. Intercom shipping Fin on top of seat-based pricing. Notion layering AI as a per-seat add-on. Atlassian Intelligence rolling out across Jira and Confluence. The team has billed per-seat for years and now needs to charge for inference-backed features without rewriting the engine. Product wants AI live in two weeks. Engineering owns a sidecar nobody wants to maintain. The CFO wants to know if the feature has positive margin. Nobody can answer cleanly because token data lives in logs, not invoices.<br>The AI-native team building on top of LLMs. Cursor, Lovable, Replit, voice and browser agents. They pay a per-token rate to a model provider and bill the user with margin on top. Cost-plus, end to end. Every point of margin matters because COGS is variable per-customer and tracked in real time. Under-count and they bleed margin. Over-count and they lose trust. The middleware has to be exact, every release, for every model they add.<br>Both groups built the same plumbing. We're tired of building it.<br>Wrap once<br>Before, billing an LLM call looked something like this.<br>resp = client.converse(modelId="...", messages=[...])
usage = resp["usage"]<br>billing.send_event(customer_id, "llm_input_tokens", usage["inputTokens"])<br>billing.send_event(customer_id, "llm_output_tokens", usage["outputTokens"])<br>billing.send_event(customer_id, "llm_cache_read", usage.get("cacheReadInputTokens", 0))<br># ... repeat for cache writes, tool calls, reasoning tokens, streaming chunks<br># ... then write it all again, differently, for the next provider you add
After, you wrap the client once.<br># OpenAI<br>client = sdk.wrap(OpenAI())<br>client.chat.completions.create(model="gpt-4o", messages=[...])
# Anthropic<br>client = sdk.wrap(Anthropic())<br>client.messages.create(model="claude-sonnet-4-5", messages=[...])
# Bedrock<br>client = sdk.wrap(boto3.client("bedrock-runtime"))<br>client.converse(modelId="...", messages=[...])
# token attribution happens automatically, per customer, across every provider
What lands in billing tells the story.<br>Old world. Anthropic returns one shape:<br>"model": "claude-sonnet-4-5",<br>"usage": {<br>"input_tokens": 1200,<br>"output_tokens": 340,<br>"cache_creation_input_tokens": 800,<br>"cache_read_input_tokens": 4000
OpenAI returns another:<br>"model": "gpt-4o",<br>"usage": {<br>"prompt_tokens": 1200,<br>"completion_tokens": 340,<br>"prompt_tokens_details": { "cached_tokens": 4000 }
Different field names, different nesting, different cache semantics. You write one extractor per provider, map the fields, send one event per dimension. Then a model adds a new field and you do it again. New world. The SDK normalizes both into the same canonical shape and batches them to Lago:<br>"external_subscription_id": "sub_acme",<br>"events": [<br>{ "code": "llm_input_tokens", "properties": { "value": 1200 } },<br>{ "code": "llm_output_tokens", "properties": { "value": 340 } },<br>{ "code": "llm_cached_input_tokens", "properties": { "value": 4000 } },<br>{ "code": "llm_cache_creation_tokens","properties": { "value": 800 } }
Same event shape regardless of provider. Customer attribution is automatic. Cache fields populate when the provider returns them, stay absent when it doesn't.<br>The wrapped client behaves identically to the original. Same arguments, same return shape, same exceptions. The SDK extracts usage from every response, normalizes it across providers, attributes it to a customer...