AI Voice Agent Architecture: How Real-Time Conversational Systems Work

ffadaie1 pts0 comments

AI Voice Agent Architecture: Lessons From 3 Production Builds

Skip to content

Post author:Farid Fadaie

Post published:June 10, 2026

Post category:AI

Post comments:0 Comments

I built the same production voice agent three times. Same requirements, same telephony stack, same speech models available — three fundamentally different architectures. The first one collapsed under its own coupling: every bug fix broke something else. The second one was controllable and correct, and it destroyed the one thing a voice agent exists for: responding like a human, immediately. The third one is the only one that survived contact with real callers.

This post is the architecture write-up I wish I’d read before spending months and real money finding out the hard way. It’s about one question that turns out to decide everything: who owns the next turn?

The problem shape

The agent answers real phone calls. It has to do three things at once, and they pull in different directions:

Hold a natural conversation. Sub-second turn-taking, no dead air, graceful handling of interruptions, callers who ramble, and callers who switch languages mid-sentence.

Capture structured data reliably. A defined set of fields must end up in a database — names spelled right, numbers digit-perfect, nothing invented.

Enforce hard rules. A few behaviors must always happen (reading a phone number back digit by digit) and a few must never happen (hanging up while a required question is unanswered, promising things the business didn’t authorize).

Every architecture below is a different answer to where those three responsibilities live.

Architecture 1: the server-side orchestrator

Architecture 1: every caller turn fans out to parallel classifiers; shared flags couple everything to everything.

The first build treated the speech model as a mouthpiece. The server owned everything: each caller utterance was transcribed and fanned out to a battery of parallel classifiers — one deciding the call category, one watching for a provider name, one watching for end-of-call intent, one detecting reference-only questions, plus hand-maintained regex banks for language detection (including phonetic transliterations of "can you speak X to me?" in half a dozen scripts). A merge layer combined their outputs into a "turn analysis," and a policy engine picked what to say next, often from scripted prompts.

Why it’s tempting

Every decision is inspectable server code. When something goes wrong, there’s a line number.

Each classifier is small, cheap, and unit-testable in isolation.

Adding a rule looks easy: write another classifier, wire it into the merge.

Scripted prompts give you word-level control over what the agent says.

Why it collapsed

The classifiers were never actually independent. The merge layer and a growing web of shared session flags meant every concern’s guard read every other concern’s state. The provider gate needed to know about the end-of-call state; the end-of-call logic needed to know whether a language switch was pending; the language logic needed to know whether a scripted prompt was mid-playback. Fixing a bug in one guard reliably broke an invariant another guard depended on. This is the "too intertangled" failure: the system had no single owner for any decision, so every decision was made N times by N pieces of code that disagreed at the margins.

Regex banks don’t survive multilingual reality. The language-detection patterns grew into a museum of transliterations and edge cases — unfalsifiable, unreviewable, and wrong in ways you only discover on a live call. Every new language meant re-deriving every pattern. The model I was routing around already understood all of these languages natively.

Latency stacked. Multiple classifier round-trips per turn, then a policy decision, then speech synthesis. Each piece was fast; the pipeline wasn’t.

Mid-turn disagreement. When two classifiers fired at once — caller names a provider and says goodbye in the same breath — two subsystems both believed they owned the next turn. Most of the audible bugs (double prompts, talking over the caller, contradictory follow-ups) were exactly this.

The deep lesson from architecture 1: distributed decision-making without a single owner is the bug. It doesn’t matter how clean each component is; if N components can each initiate speech, you ship race conditions to people’s ears.

Architecture 2: server-gated turn-taking

Architecture 2: one decider, full control — and a server round-trip of dead air on every single turn.

The obvious fix for "too many deciders" is one decider. The second build made the server the single owner of every turn: the realtime speech model was muted by default and spoke only when the server explicitly triggered a response with instructions. Caller speaks → transcript → server classifies → server decides what should be said → server instructs the model to say it.

What it bought

Maximal control. The agent literally could not speak...

turn server architecture agent post speech

Related Articles