Lean Inference: Lean Manufacturing Principles Applied to AI

Lean Inference Workflows: Applying "Lean" Concepts To Building AI Agents

neurometric’s Substack

SubscribeSign in

Lean Inference Workflows: Applying "Lean" Concepts To Building AI Agents Making inference scale in a cost effective way

Rob May Jun 03, 2026

Here’s a production scenario that should feel familiar: your agent hits a simple routing decision—does this user query need a database lookup or a calculator?—and it fires off a GPT-4o call with a 12,000-token context window stuffed with documentation it will never read, waits 4 seconds for a response, gets back malformed JSON, retries twice, and burns $0.40 to answer a question that a regex could have handled. Multiply that across 10,000 daily requests. Congratulations—you’ve built an inference money pit. The AI engineering community collectively discovered that “just throw it at a frontier model” works great in demos and collapses in production. Agents enter retry death spirals. Context windows bloat with irrelevant RAG results. Sequential LLM calls stack latency until users abandon the workflow. The tools are extraordinarily powerful, and we are using them with the efficiency of a factory floor that nobody has ever walked with a stopwatch. Lean Manufacturing fixed this problem for physical production 40 years ago. It’s time to apply the same discipline to inference. Lean Inference Workflows are the systematic application of Lean/TPS (Toyota Production System) principles to the design of LLM-powered agent architectures. Not as metaphor—as engineering discipline.

The 7 Wastes of LLM Inference

Taiichi Ohno’s muda framework identified seven categories of waste in manufacturing. Each maps cleanly onto the failure modes we build into agents every day. 1. Overproduction — The Frontier Model Default

The most expensive waste is calling a 70B+ frontier model for tasks that don’t need it. Routing a support ticket to the right queue? That’s an 8B classification task. Extracting structured fields from a form submission? That’s a fine-tuned 3B model with a JSON schema. Summarizing a 500-word support thread? You don’t need GPT-4o. The cost asymmetry is staggering. claude-sonnet runs ~3x the cost of haiku per token. GPT-4o runs ~10x the cost of GPT-4o-mini. When you reflexively reach for the frontier model on every step of a 15-step agent loop, you’re not just overspending—you’re adding latency at every node. If your task is a common one, you can even move to SLMs which are faster and two orders of magnitude cheaper. Treat your agent’s model selection the same way a traffic engineer treats routing decisions—based on payload size, complexity score, and confidence threshold, not habit. 2. Inventory — RAG Bloat

Your vector database returns the top-20 chunks, and you shove all 20 into the context window “just in case.” That’s inventory waste: stockpiling inputs you probably won’t use, forcing the model to process them, inflating your input token count, and degrading retrieval precision in the process. More context isn’t better—it’s a longer assembly line with more defect opportunities. Controlled inventory means retrieving fewer, better chunks via re-ranking (a cross-encoder pass over your top-k candidates), then truncating aggressively before injection. 3. Waiting — Sequential Blocking

Tool calls that could run in parallel are running in series. You need to fetch a user’s account history, check their subscription tier, and retrieve their recent support tickets. Instead of three parallel async calls, you have three sequential blocking calls: 300ms + 280ms + 310ms = 890ms of pure waiting. async/await + parallel execution is the asyncio.gather or Promise.all call you should have made. In a multi-step agent DAG, every synchronous bottleneck is a latency tax. 4. Defects — Malformed Outputs and Retry Loops

An agent asks for a JSON tool call. The model returns Markdown-wrapped JSON with an extra trailing comma. Your parser throws. The orchestrator retries. The model hallucinates a different schema on the retry. You’re now three LLM calls deep on a task that should have been one. Defects in inference are uniquely expensive because retries aren’t cheap reruns—they’re full-price LLM calls on an already-failed path. Structured outputs (OpenAI’s response_format, Anthropic’s tool use schemas, the instructor library for Python) eliminate this entirely by constraining output at the token-probability level. 5. Over-Processing — Unnecessary Chain-of-Thought

CoT is a forcing function for reasoning. It is not a default that belongs in every prompt. A routing classifier does not need to explain its reasoning to itself before assigning a ticket category. A field extractor does not need tokens. Stripping CoT from non-reasoning tasks can cut your output token count by 40–60% on those steps—with zero quality loss.

Core Principles of Lean Inference

Just-In-Time Context: The Pull System

In Lean manufacturing, a pull system means downstream demand triggers upstream...

Lean Inference: Lean Manufacturing Principles Applied to AI

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy