What TTS Throws Away

What TTS throws away

Amala David

Writing Blog

What TTS throws away

The paralinguistic gap between human speech and synthetic audio · May 2026

When a podcast host says "spaaaace" and holds the vowel for 1.1 seconds, every ASR (automatic speech recognition) transcribes it as "space." When two friends banter and one reacts with "Oh. Ooh. Mm." for 2.4 seconds of pure non-verbal communication, the transcript shows three punctuated words. When a speaker abandons a clause mid-thought — "I really have no — I never thought…" — the transcript records an em-dash. The audio records an affect shift.

That's the gap. The training data for every TTS system is built from transcripts that discard the expressive vocabulary humans actually use. Drawn-out vowels, reactive backchannels, mid-sentence register shifts, the timing of a laugh — all of it is flattened into clean text before a TTS model ever sees it. The model can't learn what it's never shown.

This post is about what that gap sounds like, why it exists, and what it would take to close it. Press play below — then expand the comparisons to see what three STT systems and four TTS engines make of the same 25 seconds.

The Read · 2-host pop-culture podcast · reminiscing about childhood movie restrictions A note on the clip: This clip contains the N-word, used colloquially by the hosts. It was selected for its paralinguistic richness — drawn-out vowels, mid-sentence pivots, backchannels, laughter — not its lexical content. STT transcripts reproduce it verbatim because verbatim fidelity is the point of the comparison.

step 1 · what STT heardThree frontier ASRs, same 25 seconds, three different transcripts Deepgram Nova-3 · conf 0.999 Oh. If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mama didn't let me watch anything that was rated worse than PG 13. That's true. And then when I went off college, I just had no desire to go back and watch all the Don't Be a Menaces. And I mean, I really have no I'd I never thought that there was a chance that you have seen that. But the point is that many of you at sentiment: negative (-0.45) · intents: Express frustration about nigga movies · topics: Movie watching · summary: Speaker 0 discusses how he didn't watch the Partners in the Middle, citing his lack of desire to return to watching movies that were rated PG 13.

OpenAI gpt-4o-transcribe If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mom didn't let me watch anything that was rated worse than PG-13. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. I mean, I really have no idea. I never thought that there was a chance that you have seen it. But the point is that many of you...

ElevenLabs Scribe v2 Oh. If you watch nigga movies, I feel like it's one of those. Mm. But for a, a massive chunk of my life I didn't, 'cause my mama didn't let me watch anything that was rated worse than PG-13. That's true. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. [laughs] And I mean, I really have no— I never thought that there was a chance that you have seen that. audio events captured: [laughs]

What this comparison shows: (1) OpenAI drops the leading 'Oh'; Eleven keeps 'Oh' AND the 'Mm-hmm' backchannel that the other two missed. (2) Deepgram's 'Express frustration' / 'negative' sentiment are wrong — the host is warm and reminiscent. (3) Only Eleven captures [laughs] mid-utterance. (4) Deepgram preserves the disfluency 'I really have no I'd I never thought'; OpenAI invents 'no idea' to clean it up. Listen for: The mid-sentence pivot at the em-dash and the [laughs] event Eleven caught. None of these STT outputs carry the affect of the clip — they capture words and (for one) events.

step 2 · what TTS does with that transcriptSame transcript fed to 4 frontier TTS labs — listen, then see where the time goes The original is 25 seconds. The same 8-segment script rendered through each TTS lab takes 28.7s to 34.9s. The chart aligns all 9 versions on a shared time axis, colored by speaker. Clean = raw STT transcript; Enhanced = same transcript with lab-appropriate affective markup. Switch tabs to compare timing vs silence placement. Timing how long each segment tookAmplitude envelope where silence vs speech lands

0s 5s 10s 15s 20s 25s 30s

Original audio (Scribe v2) Oh. If you watch… Mm. But for a massive chunk… That's true. And then when I went off… I mean, I really have no —… [laughs] But the point…

24.68s

ElevenLabs v3 — clean Oh. If you watch… Mm. But for a massive chunk… That's true. And then when I went off… I mean, I really have no —… [laughs] But the point…

28.74s

ElevenLabs v3 — enhanced Oh. If you watch… Mm. But for a massive chunk… That's true. And then when I went off… I mean, I really have no —… [laughs] But the point…

29.48s

Gemini...

What TTS Throws Away

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy