Why WebRTC beats WebSockets for realtime voice AI

jrm-veris1 pts0 comments

Why WebRTC beats WebSockets for realtime voice AI | LiveKit<br>Skip to main contentPlaceholder text for banner height reservation on mobile

GitHublivekit/agents10.9Kagents10.9K<br>livekit19.1K

Contact salesStart building

Blog/Engineering

Metadata<br>Date03.23.2026

AuthorCHRIS WILSON

Reading time11 min read

TagsENGINEERING

Share on XShare on LinkedIn

When developers start building voice AI agents, the first architectural decision is transport: how does audio get between the user and the agent? Many reach for WebSockets because they're familiar, well-documented, and already part of most web stacks. It seems like a reasonable choice — open a socket, stream audio bytes in both directions, done.

It works in a demo. It falls apart in production.

The gap between "audio is flowing" and "this feels like a real conversation" is enormous, and it's almost entirely a transport problem. WebSockets weren't designed for realtime media. WebRTC was. That distinction matters far more than most developers expect when they start building.

What WebSockets actually give you

WebSockets provide a persistent, full-duplex TCP connection between a client and server. They're great for chat, notifications, and streaming structured data. For those use cases, they're the right tool.

But when you push raw audio over a WebSocket, you inherit every property of TCP — including the ones that actively work against realtime conversation.

TCP guarantees ordered, reliable delivery. Every packet arrives, and it arrives in sequence. If a packet is lost in transit, TCP pauses the stream and retransmits it before delivering anything that came after. This is called head-of-line blocking, and for audio, it's devastating.

Consider what happens when a single packet is lost during a conversation. With TCP, the receiver stalls — possibly for hundreds of milliseconds — waiting for the retransmission. The audio that arrived perfectly fine after the lost packet sits in a buffer, unplayed, until the gap is filled. The user hears silence, then a burst of buffered audio. The conversational rhythm breaks.

In a text chat, a 200ms delay is invisible. In a voice conversation, it's the difference between a natural exchange and an awkward one.

WebSockets have no concept of media timing. Audio frames need to arrive at precise intervals for smooth playback. WebSockets deliver bytes — there's no jitter buffer, no playout timing, no mechanism to handle frames that arrive too early or too late. You have to build all of that yourself, and building it well is a multi-year engineering effort.

There's no built-in congestion control for media. TCP's congestion control algorithm is designed for bulk data transfer: it fills the pipe, detects loss, and backs off. This sawtooth pattern is fine for downloading files but terrible for realtime audio, where you need a steady, predictable bitrate. When the network degrades, TCP's response is to buffer more data and retry harder — exactly the wrong strategy for a live conversation where a dropped frame is better than a late one.

TCP windowing works against you. TCP uses a sliding window to control how much unacknowledged data can be in flight. When packets are lost, the window shrinks, throttling throughput right when you need consistent delivery. After the loss clears, the window doesn't snap back — it grows conservatively through slow start and congestion avoidance, taking multiple round trips to recover. On high-latency paths (like cross-region connections), this ramp-up is especially painful because each round trip takes longer. The result is bursts of underdelivery followed by slow recovery — exactly the kind of inconsistent throughput that turns a smooth voice conversation into a stuttering one.

What WebRTC was built to do

WebRTC was purpose-built for the problem of moving media between people in realtime. It addresses every shortcoming above with design decisions that specifically optimize for conversation.

UDP-based transport with loss tolerance. WebRTC sends media over UDP using RTP (Real-time Transport Protocol). When a packet is lost, the stream keeps flowing. A missing 20ms audio frame is nearly imperceptible to a listener; a 200ms stall while TCP retransmits is not. WebRTC trades perfect reliability for consistent timing, which is exactly the right trade-off for voice.

Built-in jitter buffers. Network jitter — variation in packet arrival times — is unavoidable on the internet. WebRTC clients include adaptive jitter buffers that absorb this variation, smoothing out playback so the listener hears a continuous stream even when packets arrive unevenly. With WebSockets, you're on your own.

Media-aware congestion control. WebRTC implements congestion control algorithms (like Google Congestion Control, GCC) that are specifically designed for realtime media. Instead of TCP's aggressive fill-and-backoff pattern, GCC measures one-way delay variation to detect congestion before packet loss occurs. When bandwidth drops,...

audio webrtc websockets realtime conversation media

Related Articles