Five multi-model patterns that cut token costs

Five multi-model patterns that cut token costs — and keep your data where you want it - Berget AI

Sign inGet started Get started

All posts Jun 10, 2026Five multi-model patterns that cut token costs — and keep your data where you want it

Marcus Olsson Developer Relations

All posts On this page&]:rotate-180">Three dimensions of every routing decision Where to run your models Five patterns for multi-model architectures Your first routing decision

Frontier API pricing has climbed sharply over the past year, and usage has climbed faster. If you've watched a single agentic session burn through a day's token budget, you're not alone. Multi-step sessions, repo-wide iteration, and long context windows consume tokens at rates flat pricing never anticipated. A couple of commits on a complex codebase can eat half a month's budget.

The obvious response is to switch to a cheaper model. But a single cheaper model recreates the same problem one tier down: you're still paying one rate for everything, hard reasoning and autocomplete alike. Some tasks genuinely need the strongest frontier model. A sub-agent exploring your codebase doesn't.

In this post, you'll learn why you should stop using one model for every task. And that every routing decision is also a data-boundary decision. The same architecture that cuts your token bill determines whether your data ever leaves the device, and whose jurisdiction it lands in when it does.

We'll work through five architectural patterns, distinguished by what triggers the escalation to a stronger model:

Feature routing — assign model tiers by feature, upfront

Cascade — try the cheap model first, escalate on failure

Advisor — a cheap executor consults a stronger planner

Specialist — a generalist hands off to a model with a capability it lacks

Draft-and-verify — generate on-device, verify remotely

Three dimensions of every routing decision

While cost is on the minds of many right now, it's not the only aspect to consider when moving to a multi-model stack. Cost, capability, and data sovereignty all decide where a task should run, and each may pull you in a different direction.

Cost. Calling a hosted Gemma 4 costs less than €0.5 per million tokens, while a strong reasoner from one of the big AI labs can run €15–50 per million output. Even within a single provider, the largest model can be 10 times the price of the smallest. For high-volume agent work, routing to the right model can drastically reduce your token costs.

Capability. Now, some tasks may still require the strongest available reasoning. But as even the smaller models are becoming increasingly more capable, it's not as clear-cut anymore. Default to cheap models and escalate only when the evidence says you must.

Data sovereignty. Where a task runs determines which rules apply. GDPR restricts transfers of personal data outside the EU/EEA, and the EU AI Act adds obligations on top. On-device inference and EU-hosted sovereign inference both satisfy jurisdiction by design, while US-hosted APIs require added legal attention. This is the second thread we'll track through every pattern, alongside cost.

Note that running models on-prem isn't necessarily cheaper. A local 70B model on a rented GPU can cost more than API calls at low volume; the savings appear once volume is high enough that fixed hardware cost beats marginal token cost. And none of the patterns below requires local inference — routing between a cheap and a strong hosted model is still a multi-model stack, and at low volume it's often the right one.

Where to run your models

The question is no longer "which model is best?" It's "what does the task need?" — and raw capability is only one of those needs. A tight latency budget, or a requirement that data never leaves the device, narrows the hardware choice just as firmly as reasoning difficulty does. Think of it as a resource spectrum rather than a hierarchy of developer machines.

Start with the constraints of where the model runs:

Edge and embedded devices — Tiny models (≤4B parameters, quantised) on phones, gateways, single-board computers, and industrial controllers. Well-defined tasks: classification, intent detection, summarisation, and autocompletion. No network round-trip, and no data leaving the device — often the whole point in IoT deployments, where connectivity is intermittent or the data is sensitive by default.

Consumer hardware with a dedicated GPU — Small models (4–40B, quantised) for conversations and straightforward coding tasks.

Workstation or on-prem server — Medium models (40–150B). Large models only for tasks that genuinely need them.

Large models (>150B) put you in multi-GPU territory: expensive, supply-constrained, and rarely worth owning. That's where hosted inference wins on economics, not just convenience.

Five patterns for multi-model architectures

So how do you manage and orchestrate workflows across multiple models? As the introduction previewed, what separates the five...

Five multi-model patterns that cut token costs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs