What TTS throws away
Amala David
Writing<br>Blog
What TTS throws away
The paralinguistic gap between human speech and synthetic audio · May 2026
When a podcast host says "spaaaace" and holds the vowel for 1.1 seconds, every ASR (automatic speech recognition) transcribes it as "space." When two friends banter and one reacts with "Oh. Ooh. Mm." for 2.4 seconds of pure non-verbal communication, the transcript shows three punctuated words. When a speaker abandons a clause mid-thought — "I really have no — I never thought…" — the transcript records an em-dash. The audio records an affect shift.
That's the gap. The training data for every TTS system is built from transcripts that discard the expressive vocabulary humans actually use. Drawn-out vowels, reactive backchannels, mid-sentence register shifts, the timing of a laugh — all of it is flattened into clean text before a TTS model ever sees it. The model can't learn what it's never shown.
This post is about what that gap sounds like, why it exists, and what it would take to close it. Press play below — then expand the comparisons to see what three STT systems and four TTS engines make of the same 25 seconds.
The Read · 2-host pop-culture podcast · reminiscing about childhood movie restrictions<br>A note on the clip: This clip contains the N-word, used colloquially by the hosts. It was selected for its paralinguistic richness — drawn-out vowels, mid-sentence pivots, backchannels, laughter — not its lexical content. STT transcripts reproduce it verbatim because verbatim fidelity is the point of the comparison.
step 1 · what STT heardThree frontier ASRs, same 25 seconds, three different transcripts<br>Deepgram Nova-3 · conf 0.999<br>Oh. If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mama didn't let me watch anything that was rated worse than PG 13. That's true. And then when I went off college, I just had no desire to go back and watch all the Don't Be a Menaces. And I mean, I really have no I'd I never thought that there was a chance that you have seen that. But the point is that many of you at<br>sentiment: negative (-0.45) · intents: Express frustration about nigga movies · topics: Movie watching · summary: Speaker 0 discusses how he didn't watch the Partners in the Middle, citing his lack of desire to return to watching movies that were rated PG 13.
OpenAI gpt-4o-transcribe<br>If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mom didn't let me watch anything that was rated worse than PG-13. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. I mean, I really have no idea. I never thought that there was a chance that you have seen it. But the point is that many of you...
ElevenLabs Scribe v2<br>Oh. If you watch nigga movies, I feel like it's one of those. Mm. But for a, a massive chunk of my life I didn't, 'cause my mama didn't let me watch anything that was rated worse than PG-13. That's true. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. [laughs] And I mean, I really have no— I never thought that there was a chance that you have seen that.<br>audio events captured: [laughs]
What this comparison shows: (1) OpenAI drops the leading 'Oh'; Eleven keeps 'Oh' AND the 'Mm-hmm' backchannel that the other two missed. (2) Deepgram's 'Express frustration' / 'negative' sentiment are wrong — the host is warm and reminiscent. (3) Only Eleven captures [laughs] mid-utterance. (4) Deepgram preserves the disfluency 'I really have no I'd I never thought'; OpenAI invents 'no idea' to clean it up.<br>Listen for: The mid-sentence pivot at the em-dash and the [laughs] event Eleven caught. None of these STT outputs carry the affect of the clip — they capture words and (for one) events.
step 2 · what TTS does with that transcriptSame transcript fed to 4 frontier TTS labs — listen, then see where the time goes<br>The original is 25 seconds. The same 8-segment script rendered through each TTS lab takes 28.7s to 34.9s. The chart aligns all 9 versions on a shared time axis, colored by speaker. Clean = raw STT transcript; Enhanced = same transcript with lab-appropriate affective markup. Switch tabs to compare timing vs silence placement.<br>Timing how long each segment tookAmplitude envelope where silence vs speech lands
0s<br>5s<br>10s<br>15s<br>20s<br>25s<br>30s
Original audio (Scribe v2)<br>Oh.<br>If you watch…<br>Mm. But for a massive chunk…<br>That's true.<br>And then when I went off…<br>I mean, I really have no —…<br>[laughs]<br>But the point…
24.68s
ElevenLabs v3 — clean<br>Oh.<br>If you watch…<br>Mm. But for a massive chunk…<br>That's true.<br>And then when I went off…<br>I mean, I really have no —…<br>[laughs]<br>But the point…
28.74s
ElevenLabs v3 — enhanced<br>Oh.<br>If you watch…<br>Mm. But for a massive chunk…<br>That's true.<br>And then when I went off…<br>I mean, I really have no —…<br>[laughs]<br>But the point…
29.48s
Gemini...