Why removing 'um' from a recording is harder than it sounds

dougcalobrisi1 pts0 comments

erm: A Local CLI That Strips Ums, Uhs, and Erms From Speech | doug.sh

erm: A Local CLI That Strips Ums, Uhs, and Erms From Speech

May 2, 2026

1547 words

8 minute read

Linguists have a word for the ums, uhs, ers, and elongated versions (ummmm, uhhhhh) that pad spoken English: disfluencies.

I don&rsquo;t record a lot of voice audio, but a few friends do, and they tell me editing those out by hand is miserable. So I built<br>erm<br>to do it.

uvx erm input.wav

That&rsquo;s the whole interface for the common case. It writes a cleaned .wav and a JSON cut list next to the input. This post walks through how it works, because the obvious approach doesn&rsquo;t sound very good and most of the code is the stuff that fixes that.

The naive version doesn&rsquo;t work ๐Ÿ”—<br>You&rsquo;d expect the job to be: transcribe with word-level timestamps, find tokens like um and uh, cut those ranges with ffmpeg.

That gets you maybe 60% of the way, and the result sounds worse than the original. Three reasons:

Whisper quietly leaves a lot of fillers out of the transcript, so there&rsquo;s no um token to match in the first place.

Slicing audio at an arbitrary point in time produces a tiny step in the waveform. Your ear hears it as a click.

Even when the splice itself is clean, the background hiss before and after the cut doesn&rsquo;t quite match, so you hear a faint shift at every edit.

Most of erm is the work of fixing those three things.

A quick word on Whisper ๐Ÿ”—

Whisper<br>is OpenAI&rsquo;s open-source speech-to-text model. You hand it audio, it hands you back a transcript, and with the right flag it&rsquo;ll also tell you the start and end timestamp of every word. It runs locally, which is what makes a tool like this possible without sending your recordings anywhere.

erm uses<br>faster-whisper<br>, a reimplementation that&rsquo;s several times faster than the reference one and uses less memory. Same model weights, same output, just a better runtime. The default is the medium.en model, which is a good speed/accuracy balance. You can override with --model if you want small.en (faster), but I&rsquo;d actually reach for large-v3. It&rsquo;s noticeably better at picking up fillers and worth the extra compute.

Detection ๐Ÿ”—<br>First, run Whisper. erm asks for word-level timestamps and gives it a small instruction up front telling it not to clean up the transcript. Whisper, left alone, will edit out fillers because most of its training transcripts are clean prose. Any word that comes back as a known filler (um, uh, er, etc.) is flagged for cutting. Elongated versions like ummmm get matched against the um stem on the fly.

Whisper still misses things, so three more passes look at the audio directly:

Gap fillers. If there&rsquo;s an unusually long pause between two transcribed words (more than 350ms by default), erm checks whether somebody is actually making a sound during that &ldquo;pause.&rdquo; If a chunk of voice is sitting inside what Whisper marked as silence, that&rsquo;s a filler Whisper deleted entirely. It really does just drop them. No token at all, just a hole in the transcript where an um used to be.

Fillers hiding inside a word. Whisper sometimes glues a filler onto an adjacent word, so "in, uhhhhh" comes back as a single in token. erm looks at long single-token words, splits them at brief dips in the audio, figures out which chunk is the actual word (based on how long that word should reasonably take to say), and treats the rest as filler.

Words that are much too long. If a word lasts way longer than its text could plausibly take to pronounce, the tail end is suspicious. erm scans the tail for voiced sound, and optionally double-checks with a pitch test: does the suspicious chunk sound like someone holding a vowel (uhhhhh), or like someone just speaking slowly? A held vowel has a steady, simple acoustic shape; real speech is constantly changing as you move between sounds. The pitch test keeps the tool from trimming slow talkers.

All four passes (the Whisper one and the three audio ones) produce candidate cuts independently, and the lists get merged before the next step.

Refining the cut points ๐Ÿ”—<br>A cut at exactly t = 1.234s lands wherever the waveform happens to be at that instant, almost never at zero. Stitching two arbitrary points together leaves a step in the waveform, and that step is the click you hear.

Two small fixes, in order. First, each cut endpoint is allowed to slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If there&rsquo;s a momentary lull in the audio just before or after the original cut point, slide there. The slide is bounded so it can&rsquo;t cross into a neighboring word, otherwise you&rsquo;d chew off real speech. Second, from that quiet spot, the endpoint snaps to the nearest moment when the waveform is exactly crossing zero. Two zero points stitched together produce a continuous waveform with no step, and no click.

After all that, very short surviving fragments get cleaned...

rsquo word whisper audio from speech

Related Articles