AI Voice Agent Architecture: Lessons From 3 Production Builds
Skip to content
Post author:Farid Fadaie
Post published:June 10, 2026
Post category:AI
Post comments:0 Comments
I built the same production voice agent three times. Same requirements, same telephony stack, same speech models available — three fundamentally different architectures. The first one collapsed under its own coupling: every bug fix broke something else. The second one was controllable and correct, and it destroyed the one thing a voice agent exists for: responding like a human, immediately. The third one is the only one that survived contact with real callers.
This post is the architecture write-up I wish I’d read before spending months and real money finding out the hard way. It’s about one question that turns out to decide everything: who owns the next turn?
The problem shape
The agent answers real phone calls. It has to do three things at once, and they pull in different directions:
Hold a natural conversation. Sub-second turn-taking, no dead air, graceful handling of interruptions, callers who ramble, and callers who switch languages mid-sentence.
Capture structured data reliably. A defined set of fields must end up in a database — names spelled right, numbers digit-perfect, nothing invented.
Enforce hard rules. A few behaviors must always happen (reading a phone number back digit by digit) and a few must never happen (hanging up while a required question is unanswered, promising things the business didn’t authorize).
Every architecture below is a different answer to where those three responsibilities live.
Architecture 1: the server-side orchestrator
Architecture 1: every caller turn fans out to parallel classifiers; shared flags couple everything to everything.
The first build treated the speech model as a mouthpiece. The server owned everything: each caller utterance was transcribed and fanned out to a battery of parallel classifiers — one deciding the call category, one watching for a provider name, one watching for end-of-call intent, one detecting reference-only questions, plus hand-maintained regex banks for language detection (including phonetic transliterations of "can you speak X to me?" in half a dozen scripts). A merge layer combined their outputs into a "turn analysis," and a policy engine picked what to say next, often from scripted prompts.
Why it’s tempting
Every decision is inspectable server code. When something goes wrong, there’s a line number.
Each classifier is small, cheap, and unit-testable in isolation.
Adding a rule looks easy: write another classifier, wire it into the merge.
Scripted prompts give you word-level control over what the agent says.
Why it collapsed
The classifiers were never actually independent. The merge layer and a growing web of shared session flags meant every concern’s guard read every other concern’s state. The provider gate needed to know about the end-of-call state; the end-of-call logic needed to know whether a language switch was pending; the language logic needed to know whether a scripted prompt was mid-playback. Fixing a bug in one guard reliably broke an invariant another guard depended on. This is the "too intertangled" failure: the system had no single owner for any decision, so every decision was made N times by N pieces of code that disagreed at the margins.
Regex banks don’t survive multilingual reality. The language-detection patterns grew into a museum of transliterations and edge cases — unfalsifiable, unreviewable, and wrong in ways you only discover on a live call. Every new language meant re-deriving every pattern. The model I was routing around already understood all of these languages natively.
Latency stacked. Multiple classifier round-trips per turn, then a policy decision, then speech synthesis. Each piece was fast; the pipeline wasn’t.
Mid-turn disagreement. When two classifiers fired at once — caller names a provider and says goodbye in the same breath — two subsystems both believed they owned the next turn. Most of the audible bugs (double prompts, talking over the caller, contradictory follow-ups) were exactly this.
The deep lesson from architecture 1: distributed decision-making without a single owner is the bug. It doesn’t matter how clean each component is; if N components can each initiate speech, you ship race conditions to people’s ears.
Architecture 2: server-gated turn-taking
Architecture 2: one decider, full control — and a server round-trip of dead air on every single turn.
The obvious fix for "too many deciders" is one decider. The second build made the server the single owner of every turn: the realtime speech model was muted by default and spoke only when the server explicitly triggered a response with instructions. Caller speaks → transcript → server classifies → server decides what should be said → server instructs the model to say it.
What it bought
Maximal control. The agent literally could not speak...