Tokenminning — The Manifesto
§ 0A Reckoning Is Coming
Somewhere between the demo and the deployment, a quiet catastrophe is forming.<br>It does not announce itself. It accumulates in API invoices, in GPU reservation queues, in data center power contracts that utilities are struggling to honor, in the growing gap between what organizations budgeted for AI and what they are actually paying. It lives in the architecture decisions made during prototyping — decisions built around convenience, speed-to-demo, and a default assumption that has quietly become one of the most expensive beliefs in enterprise technology: more tokens means better results.<br>That assumption is the disease. This manifesto names it, diagnoses it, and proposes the cure.<br>Global inference costs as a share of enterprise cloud spend are growing at rates no analyst modeled in 2022. The organizations that survive and thrive in the next phase of AI adoption will not be the ones that threw the most tokens at their problems. They will be the ones that learned to spend tokens with intention — to compress without losing signal, to route with precision, to build systems that accomplish their goals with the minimum viable inference footprint.<br>We call this discipline tokenminning .<br>It stands in direct contrast to tokenmaxxing — the implicit, often unconsidered practice of maximizing token consumption across prompts, context windows, reasoning chains, and agentic loops, in the mistaken belief that more is always better.<br>Tokenminning is not frugality. It is not constraint. It is not a compromise imposed from outside by a procurement team alarmed by the AI bill.<br>It is engineering excellence, applied to inference.<br>This is the founding document of the tokenminning movement. If you have been quietly trimming prompts, questioning default verbosity, pushing back on the instinct to "just throw everything in the context window," and wondering why no one else in the room seems to care about the inference bill — you have already been practicing tokenminning. This is your vocabulary. This is your community.
§ 1<br>§ 1The Tokenmaxxing Problem
1.1 What Tokenmaxxing Is<br>Tokenmaxxing is the practice — rarely explicit, almost always present — of maximizing token throughput at every layer of an LLM-powered system. It manifests as system prompts that run thousands of words long "just to be safe." It manifests as context windows stuffed with entire codebases, entire document repositories, entire conversation histories. It manifests as chain-of-thought instructions that demand elaborate step-by-step reasoning for tasks that do not require it. It manifests as agentic loops that append every tool call result in full to an ever-growing context rather than distilling observations into compact state.<br>Tokenmaxxing did not arrive maliciously. It emerged from a confluence of reasonable-seeming observations made early in the LLM era. Long chain-of-thought reasoning demonstrably improved performance on multi-step mathematical and logical tasks. Providing more examples in few-shot prompts seemed to improve reliability. Larger context windows were marketed as features — capabilities to be used, not ceilings to be respected. LLM orchestration frameworks lowered the friction of building complex pipelines to near zero, but built in defaults that maximize context retention rather than minimize it.<br>The result was a cargo cult: practitioners observed that more tokens sometimes helped, and concluded that more tokens always help. This is not an irrational inference from limited data. It is simply wrong.<br>1.2 The Hidden Costs Nobody Audits<br>The costs of tokenmaxxing are real, multidimensional, and almost universally untracked.<br>Latency cost. Inference time scales roughly linearly with output token count and super-linearly with prefill size on most transformer architectures due to attention computation over the full context. A prompt that is twice as long is not merely twice as expensive — it is often meaningfully slower at the wall-clock level, particularly as context lengths grow. Users experience this as sluggishness. Engineers experience it as SLA violations. The token is not an abstract unit; it is time.<br>Financial cost. Inference is billed by the token, and the bill is proportional to consumption. A system prompt that has grown from 400 to 4,000 tokens over successive iterations of "just add one more instruction" has increased per-call input cost by an order of magnitude. Multiplied across millions of daily API calls, this is not a rounding error. It is a budget line item that most finance teams cannot yet see because AI cost attribution at the feature level remains rare.<br>Energy cost. Every token processed is compute executed, and compute consumes power. The arithmetic is not abstract. Each forward pass through a large transformer model performs billions of floating-point operations. Those operations require energy. At scale — across an organization, across an industry — the aggregate energy demand...