On-device AI is a margin decision - Ziraph blog
Beta version (soft) launch soon! Subscribe to the announcement list to hear the moment the website and beta open ↗
That single choice ripples into your gross margin, your privacy posture, your app-store reviews, and how much of your senior engineers’ time disappears into hardware archaeology, which is why it deserves more than a line in a planning doc.
The margin case is the loudest of those, so I start there, but it is only one of several aspects worth weighing before you commit a roadmap to on-device inference, or decide to rule it out.
1. The unit economics
Cloud inference is a cost of goods that scales with every user and every token and never stops, which makes it the rare software line item that gets worse precisely as you succeed. On-device inference inverts that, because the marginal cost of a token drops to roughly zero, paid for by silicon the user already bought, so for any product where the AI is central to what it does rather than a small add-on, moving inference onto the device becomes one of the largest gross-margin levers on the table.
It is also the lever most often assumed to work rather than checked, and that gap is the thread running through every other aspect below.
2. Privacy and compliance
For anything that touches sensitive data, whether that is health, finance, legal work, or a user’s personal context, the line “it never leaves the device” is both a feature you can charge for and a compliance story that shortens enterprise sales cycles, and on-device inference is how you earn the right to say it.
The catch is that “local” has to actually be local: WWDC 2026 was a useful reminder that even Apple’s “on-device” Foundation Models can route to the cloud, with the flagship tier running on NVIDIA GPUs in Google’s cloud and no public signal of when that switch flips, so a privacy claim ends up being only as strong as your ability to verify the work really stayed on the user’s silicon.
3. The battery is your real budget
The failure mode of on-device AI rarely arrives as a server bill; it arrives as the user’s battery draining, because a feature that pins the GPU and heats the phone gives you no billing alert at all, only one-star reviews and uninstalls.
On a battery-powered device, energy per token and thermal headroom stop being backend numbers and become product-quality metrics, and the trade-offs are usually not the ones you would guess, since a model that runs ten percent faster while drawing twice the energy per token is often the worse choice for a phone, and that is exactly the kind of thing throughput alone will never surface, which is why you want it measured before the feature ships rather than after the support tickets start arriving.
4. The stack is fragmenting, so choose deliberately
A year ago, “on-device on Apple” mostly meant Core ML, whereas today Core ML, Core AI, MLX, Ollama, and llama.cpp all coexist with overlapping remits and no settled answer about which one a given team should reach for.
Your engineers will pick one or several, and the part that matters for you is that the first-party tooling is uneven: Apple’s own profiler is timing-only, bound to Xcode, and can see only your own app, so it cannot profile Ollama or MLX or an arbitrary process, and it reports no energy, no memory bandwidth, and no thermal state at all.
Whichever stack you land on, you will want visibility that spans it rather than tooling locked to a single framework.
5. Your users are not on your dev machine
A model that flies on the M-series in your laptop can throttle, or quietly fall back to a slower accelerator, once it is running on the median device in your install base rather than on the machine you build on.
Even which on-device model you get is hardware-gated rather than selectable, because Apple ships a smaller dense model to most chips and reserves the larger one for its most capable silicon, so the decision really has to be made against the hardware your users actually carry, at the quantization and the context length you actually ship, and not against the best-case benchmark you happened to run on a top-end Mac.
6. The cost of not knowing is senior-engineer time
“Is it on the Neural Engine or falling back? Why did it slow down at long context?” is, today, multi-day hardware archaeology, which means your most expensive engineers spend days on detective work that the tooling ought to answer in a single command.
There is a second cost hiding inside the first one, because the numbers your team reports upward, to you, to a board, or to a customer, should be measured and defensible rather than a vendor TOPS spec or a hopeful estimate, and a claim about your on-device performance that cannot survive a skeptical engineer has no business being in the deck in the first place.
The phrase it all hinges on: on the...