Netflix Wiz creates app to slash AI bills, then open sources it

Netflix wiz creates app to slash AI bills, then open sources it

Jump to main content

REG AD

AI + ML

Netflix wiz creates app to slash AI bills, then open sources it

Project Headroom could save you big money, too

Joab Jackson

Joab Jackson

Published sun 31 May 2026 // 08:00 UTC

As the COOs from both Uber and Microsoft recently learned, encouraging company engineers to use AI aggressively can lead to hefty usage bills, perhaps even offsetting all the gains from laying off employees. The AI bills at Netflix may not be so eye-popping thanks to company senior engineer Tejas Chopra, who has created software to prune agent instructions, as measured in tokens, before they hit the LLM. Chopra has estimated that as much as 90% of tokens are redundant to the giant thinking machine of your choice.

REG AD

Although not an official Netflix project, several teams there already use Project Headroom, and a number of external projects rely on it as well.

REG AD

In a talk at the Open Source Summit last week, Chopra said that Headroom has saved an estimated $700,000 for its users, who collectively now have 200 billion tokens to spend elsewhere. Not bad for an open source application that’s been out only since January. Headroom, currently at a still-raw v0.22, has gathered 2,000 stars on GitHub and has been forked over 120 times. “A lot of our users are people who have been really burned by token costs, more than anything else,” Chopra said in his presentation. Lossless context compression A $287 bill from Claude Sonnet first brought Chopra’s attention to the idea of token economization. The bill was typical home project stuff: a bit of debugging, some refactoring, MCP tools querying a database. At the time, Claude Sonnet’s token-based pricing seemed pretty generous: $3 for every million input tokens, or $6/million if you went over the 200,000 token limit for your context window. Still, that $287 added up quickly. Upon deeper inspection, Chopra found a lot of this data was highly redundant to the LLM. By and large, his own hand-crafted instructions were not the culprit. Rather it was all the boilerplate and machine metadata that came along for the ride: Needlessly-verbose JSON schemas, nested templates within API responses, identical database columns. “This isn’t prose. This isn’t creative writing. This is compressible data masquerading as text,” Chopra wrote in a blog post introducing his software. In 2025, a group of researchers found that reading user input accounted for about 76% of all token consumption.

REG AD

The model providers have their own tools to save tokens. But to date, the settings on these tools are somewhat oblique to end users. By default, Claude has a prefix cache setting of just five minutes. After five minutes of inactivity, the entire context window needs to be refreshed, even if the LLM needs the exact same data. Another setting is exposed in the API documentation: a one-hour time to live (TTL). But there is a catch. "You pay two times the cost for your writes to get 90% savings for your reads," Chopra told the audience. It’s up to you to find the sweet spot. There are also a number of new commercial token barbers popping up, such as YCombinator-funded Token Company, which offers token compression as a service. On the open source side there is RTK (Rust Token Killer), which trims to the output of verbose commands, such as calls to a repository. Another open source project, LeanCTX, is a variant of RTK. All these tools are useful, Chopra admitted, but he designed Headroom to keep the operations confined to the developer’s workflow. And it had something none of the apps and services could offer: reversible compression. Headroom’s job is to compress all the source material that is fed into the user’s context window – not only the conversation history, but also logs, tool outputs, files, chunks of documentation that the RAG found useful – before it arrives at the LLM. The context window is the set space for each user session. The latest frontier models are rapidly expanding their context windows upwards towards two million tokens, which holds both input and output. Such generosity is a mixed blessing, as Pope Leo might point out. As a unit of measurement, a single token is more or less equivalent to a human word. For pay-as-you go plans, the more you feed the context window, the more you’ll pay. Gobbling tokens like Pac-Man Running on Python and Node, Headroom runs as a proxy (port 8787) on the engineer’s computer. The user wraps their LLM at the command line interface (i.e. “headroom wrap codex”) and it then parses the input.

REG AD

While Headroom does compress a bit of programming code and human instruction, it is best at chopping server logs (90% of which can be jettisoned), MCP tool outputs (70% redundant JSON), Database outputs (it’s all one schema), and file trees (much repeated metadata). Headroom’s first step is a process called CacheAligner which looks only for information...

Netflix Wiz creates app to slash AI bills, then open sources it

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan