Why I care so much about energy per token - Ziraph blog
Beta version (soft) launch soon! Subscribe to the announcement list to hear the moment the website and beta open ↗
The trend line is clear: AI is moving on-device, and as it does, energy per token becomes the currency that decides the real choices - which model, which quant, which runtime, on which chip.
Every local AI tool I run prints one number proudly, in big friendly digits at the end of a run: tokens per second. Not one of them prints the bill.
That bothered me for a long time before I did anything about it. tok/s tells you how fast the answer arrived. It tells you nothing about what the answer cost - how many joules your laptop spent, how hot the die got, how much of your battery you just traded for those tokens. On a datacenter GPU you can wave that away; someone else pays the power bill and the room is air-conditioned. On an M-series MacBook running on battery, you pay it, in watt-hours and in fan noise and in thermal throttling that quietly halves your throughput two iterations in. The number that matters there is not how fast - it is how much. So I built Ziraph around measuring joules, not just counting tokens. This is why.
tok/s is vanity. Energy per token is sanity. Joules are reality.
Speed and cost are different axes, and the gap between them is where the interesting decisions live. A runtime can win on tokens per second and lose on joules per token - and if tok/s is all you logged, you would never know. You would ship the faster one and quietly pay more for every token it produces, forever.
Take my gemma-4 field report - matched quant, the same weights, llama.cpp against MLX on one M1. One engine burned 2 to 4.5 times the CPU energy of the other to produce the same answer. That is a figure nobody else prints, and you would never have known it was there: by tokens per second the two were a dead tie (reported decode 7.20 against 7.16, both pinned at the memory-bandwidth wall), and GPU energy per token came out a wash. A speed-only reading says "identical, flip a coin." The CPU-energy bill said otherwise - up to four-and-a-half times heavier on the MLX side - and that is exactly the kind of difference you ship blind when tok/s is all you logged.
That is the case where decode ties and only the energy tells the two apart. The sharper case is the one tok/s hides entirely: the faster runtime that is also the hungrier one. You cannot see either without measuring the bill. So I measure the bill.
Why energy per token is the right unit
Not energy. Energy per token.
Raw watts tell you the chip is busy; they do not tell you whether it is busy doing useful work. Total joules for a run conflate a long answer with an expensive one. Energy per token is the unit that normalises all of that away - across model size, across quantization, across which chip you happen to be on. It is the cost of one unit of output, which is the thing you actually care about when you are choosing a model, a quant, or a runtime.
And it converts cleanly into the things that bite you in practice. Energy per token times tokens is battery drain. Energy per token sets the thermal envelope, which sets whether you sustain throughput or throttle. Energy per token, multiplied out, is dollars per token when you do eventually run at scale. It is the one number that ties the micro-decision (which quant?) to the macro-consequence (how long does the laptop last, how hot does the phone get, what does inference cost). That is why it is the metric I built the whole tool to surface.
The personal why
The honest origin is smaller than the thesis. I kept running models locally, and I kept wanting to answer a simple question - which of these is cheaper to run? - and I could not. Every tool showed me speed. None showed me cost. The data was right there in the silicon's own counters; nothing surfaced it for a single workload. That gap - the chip knows, and won't tell you - is the entire reason Ziraph exists. I did not set out to build an energy profiler. I set out to stop guessing about the bill, and the profiler is what that turned into.
It is not just my hunch - the field is converging on this
Energy per token has been quietly becoming the serious unit for measuring inference. The ml.energy project and the Zeus measurement framework out of the academic ML-systems world have spent years making energy a first-class, measured quantity for model inference rather than an afterthought - energy per token, leaderboards, the works. MLPerf added a Power category for the same reason: throughput numbers without a power number are half a result.
On Apple Silicon specifically, the published work backs the method. Tummalapalli et al. (arXiv 2603.23640) measured an iPhone 16 Pro halving its throughput within two iterations under sustained load; Orion (arXiv 2603.06728), reverse-engineering the Neural Engine on an M4 Max, confirms that LLM inference frameworks universally target the GPU, not the...