Solving end-of-turn detection: LiveKit Turn Detector v1.0 | LiveKit<br>Skip to main contentPlaceholder text for banner height reservation on mobile
GitHublivekit/agents11.1Kagents11.1K<br>livekit19.3K
Contact salesStart building
Blog/Product
Metadata<br>Date06.17.2026
AuthorsCHENGHAO MOULEIGH WESTONJERAD FIELDS
Reading time8 min read
TagsPRODUCT
Share on XShare on LinkedIn
Every voice agent has to answer the same question, over and over, on every pause: is the user done talking? Answer too early and the agent talks over people. Answer too late and the conversation fills with dead air. End-of-turn detection is the difference between a voice agent that feels like a conversation and one that feels like a walkie-talkie, and it has been one of the hardest open problems in voice AI since the first agents shipped.
Today we're releasing two models that listen to the user's speech directly instead of waiting on a transcript, fusing semantic and acoustic understanding into a single end-of-turn prediction. LiveKit Turn Detector v1 posts the strongest results of any model we evaluated, across English and 13 other languages. It runs on optimized inference in LiveKit Cloud, at no charge for agents running there, and is now the default. v1-mini is an open-weight model with the same architecture, optimized for fast CPU inference.
Our goal with v1 was to make turn detection so good that you never have to think about it. With this release, we consider end-of-turn detection a solved problem for agents built on LiveKit.
The long road to this release#
We've been working on this problem for a while. In 2024 we shipped an open-source transformer model for turn detection that used the semantic content of the transcript to predict whether a user had finished a thought. Later versions cut unwanted interruptions by 39% and extended coverage to more languages. Each generation got better at understanding what users were saying.
But text-based models, however good, share a ceiling. To break through it, we had to stop reading and start listening.
Why text alone falls short#
Text-based end-of-turn models are highly effective at capturing user intent and semantic meaning. But relying on text alone imposes three structural limits.
First, the model is only as good as the transcript it's fed: errors, latency, or inconsistencies in speech-to-text directly degrade predictions. Second, transcription itself adds delay, since inference can't begin until the final transcript arrives, and that latency lands directly on the agent's response time. Third, and most fundamentally, reducing speech to text throws away the timing and acoustic signals that tell you when a speaker has actually finished.
Consider two cases where a user pauses after "pizza":
Agent: What would you like to order?
User: I would like to order one large pizza…
Agent: What would you like to order?
User: I would like to order one large pizza… and a garlic bread
At the moment of the pause, the transcript is identical in both cases. No amount of semantic modeling can distinguish them, because the distinction isn't in the words; it's in how they're delivered. Humans resolve this effortlessly using paralinguistic cues: intonation, pitch, rhythm. An upward inflection signals an unfinished thought; a drop in pitch often signals completion. When speech is reduced to text, those signals are gone.
Listening instead of reading#
LiveKit Turn Detector v1 keeps the LLM backbone our previous models used for semantic reasoning, and adds audio encoders that process the user's speech directly. The model captures both what is being said and how it's being said, with no transcription step in between.
The semantic branch uses an audio encoder, a learned adapter, and a fine-tuned language model. The adapter projects the user's audio into the embedding space of the LLM, so the model gets the same kind of semantic signal a text-based version would, without ever going through a transcript. The acoustic branch runs a separate encoder into a recurrent layer that captures timing and prosody. A fusion module combines both encodings into a single end-of-turn prediction.
This design has two practical consequences. The latency cost of waiting for a transcript is gone: inference proceeds directly from the audio stream. And because the acoustic branch carries strong information about the current user turn, the model no longer needs prior turns of chat context the way text-only versions did. It looks at the current user turn only, which keeps the context window short and inference fast.
An open benchmark for end-of-turn detection#
A claim like "state of the art" raises an immediate question: state of the art on what? There are no established public benchmarks for end-of-turn detection. Every provider evaluates on private data, with methodologies that rarely match how models behave in production. That makes results impossible to compare and impossible to reproduce.
So alongside v1, we're releasing...