Don't Build a Router. Train the Small Model to Know When to Defer. — distil labs
How It Works Customers Blog Pricing Docs Contact<br>Start with CLI
Join
← All content Guide Tool CallingAgentic AI<br>Don't Build a Router. Train the Small Model to Know When to Defer.<br>June 17, 2026 View on GitHub<br>You’re building a customer-support assistant. A frontier LLM handles every conversation well, but you’re paying frontier prices for “look up my reservation” and “what’s my baggage allowance,” which are the overwhelming majority of turns. The hard turns (refund eligibility under fare rules, compensation math across passengers, multi-constraint rebooking) are a small minority, but they’re the ones where a small model quietly gets it wrong. Pick either model alone and you either bleed money on trivial turns or risk silent errors on the turns that matter most.
A two-tier cascade gets the best of both. The fine-tuned SLM resolves the easy majority at a fraction of the cost, and the frontier model handles the hard minority where it actually earns its price. Each model does only what it’s good at, so the pair matches the large model’s quality at a fraction of its cost, without the silent errors you’d get from the small model alone. 2 + 2 > 4.
What makes this practical is that there is no elaborate routing system to build. No separate classifier, no confidence thresholds to tune, no second model deciding who handles what. The SLM is trained to recognize when it’s out of its depth and emit a single defer_to_larger_model tool call, and the orchestrator just honors it.
We back this up with a working demo: a flexible airline customer-support bot where a fine-tuned Qwen3-1.7B handles the bulk of turns and escalates the hard ones to a larger model. If you have a support workflow in mind, get in touch and we’ll show you what a deferral SLM can do for your domain.
Two Bad Options
Run every turn on a frontier model and you get the accuracy, but you pay frontier prices for traffic that is mostly trivial lookups. Run every turn on a small fine-tuned model and the economics flip in your favor, but the hard tail is exactly where a small model returns a confident, wrong answer. The gap between the two is wide. For a typical support turn (policy plus tool schemas plus dialogue history plus the user message, roughly 800 input and 100 output tokens):
ApproachCost / 1M turnsLatency / turnFrontier model, every turn~$3,000500-1,200 msSmall model (cloud), every turn~$600100-300 ms<br>Frontier cost from GPT-4o list pricing of $2.50 / $10 per 1M input/output tokens, applied to the per-turn payload above. The small model runs at roughly a fifth of that on small-model token rates, and latency reflects each model served behind a network call.
On paper you have to choose: pay for accuracy you don’t need on every turn, or save money and accept that the hard turns will sometimes come back wrong. Most teams pick one and live with the downside.
The Best of Both Worlds
You don’t have to choose. A two-tier cascade sends the easy majority of turns to the small model, which answers at roughly a fifth of the cost and a few hundred milliseconds, and routes only the genuinely-hard minority to the frontier model, where the accuracy is worth its price. You get the frontier model’s answer on the turns that need it and the small model’s economics and speed on the turns that don’t, in one system.
Because the frontier model is billed only on the deferred minority, the blended cost stays far below running everything on the frontier, while conversation quality holds because the hard turns are exactly the ones that get escalated. In our demo the small model handles ~96% of turns and escalates only the hardest ~4%, so the blended bill is dominated by the cheap tier (≈$700 per 1M turns at the per-turn rates above, against ~$3,000 for all-frontier) while quality stays statistically indistinguishable from running everything on the frontier model (see the Results below). Cheaper than all-frontier, more reliable than all-small. 2 + 2 > 4.
The obvious worry is that a cascade adds moving parts: another model to route traffic, thresholds to tune, a system to keep in sync. It doesn’t. The routing decision lives inside the small model itself.
No Router. The Model Just Knows When to Defer.
The usual way to add a cascade is to bolt on a router: a difficulty classifier, a confidence threshold off the model’s logprobs, or a second model deciding who handles what. That is all extra infrastructure to tune and keep in sync with your policy, and confidence heuristics are badly calibrated to begin with.
We skip it. The deferral is trained into the small model: during distillation the teacher marks the genuinely-hard turns and the student learns to recognize them. At runtime the SLM emits a single defer_to_larger_model tool call, exactly like any other tool, and the orchestrator hands off the rest of the conversation. The decision uses the model’s full view of the conversation...