Denoising Voice Recordings On-Device

sudb1 pts0 comments

Denoising voice recordings on-device — Duration<br>← BlogDenoising voice recordings on-device<br>26 June 2026<br>Here are a few recordings of speech in various noisy settings, alongside versions that have been de-noised with an open-weight model that we ported to iOS and Android:<br>Background musicOriginal<br>Cleaned

A podcast in the backgroundOriginal<br>Cleaned

A fan runningOriginal<br>Cleaned

A clean recording, with pink noise addedOriginal<br>With noise<br>Cleaned

We think the music removal and speaker separation here are particularly impressive, especially for something that runs on phones. It even gets close to certain paid cloud services, which we didn’t expect. There are some comparisons against other denoising models below if you want to see how it compares.<br>We shrank the original model to under half its size with little quality loss, and wrote our own ports for both Android and iOS for speed. On an M1 MacBook Pro it cleans a 10-second clip in around 0.4 seconds. On phones, iPhone hardware turns out to be much better suited for the job at hand. A 2021 Fairphone 4 runs in 4 to 5 seconds, and a 2025 Samsung Galaxy S25+ runs in 1 second – but so does a 2020 iPhone 12 Pro.<br>Our flagship app (launched, but currently under wraps) brings frontier open-weight text-to-speech (TTS) models on-device. This makes zero-shot voice cloning possible entirely on the phone: the user records 5 to 10 seconds of their voice, and from that short reference clip the TTS model can immediately produce audio in the user's voice.<br>The catch is that zero-shot cloning is very sensitive to the reference (understandably), so we set about finding the best way to clean it. We started with DeepFilterNet3, a tiny and capable denoiser, and compared it against everything else we could reasonably run, including the ElevenLabs audio isolation API and the GPU-only SEMamba++ model (released earlier in 2026) to find the ceiling for denoising.<br>The same podcast clip, through every modelOriginal<br>Ours<br>ElevenLabs<br>DeepFilterNet3<br>DPDFNet<br>SEMamba++

Of the clips above, the two that come through cleanest are ours and ElevenLabs, and to our surprise they are close. On speech, ElevenLabs is just a little cleaner. A model from a few years ago, shrunk to run on a phone, holds its own against a service running on proper servers – offline, for free, and without the recording ever leaving the device.<br>SEMamba++ struggles with overlapping speakers, though it’s otherwise a capable denoiser and good at reverb removal (see below).<br>There's a further drawback for our use case: SEMamba++ runs at 16 kHz. On a high-quality recording, the output ends up sounding a little muffled. This is because audio sampled at 16 kHz can’t represent frequencies above 8 kHz. Things like breath and sibilance, the features that make a voice sound “real”, sit well above that, toward the 20 kHz edge of human hearing.<br>A clean recording, through SEMamba++Original<br>SEMamba++

Above the 8 kHz line, the air and sibilance – some of what makes a voice sound real – are gone.For our purposes we don't want a denoiser that also degrades audio quality, so the model we settled on runs at 48 kHz.<br>It's odd that a model from a few years ago seems to be our best option, but it meets every criterion we care about: it's open-weight, we were able to port it to run fast on-device, it outputs at 48 kHz, and it removes background speakers and music better than any other open-weight model we tried. That last point matters more than it sounds, because a voice-recording-based feature has to work more or less anywhere: a kitchen with the radio on, a train, a room with a TV on in the background.<br>One thing our chosen model does not handle well is reverb, which gets left in with the voice. There are tools aimed only at de-reverberation, and both SEMamba++ and ElevenLabs remove it here.<br>Reverb exampleOriginal<br>Ours<br>ElevenLabs<br>SEMamba++

A lot of the work we did here was unglamorous – compressing the model and wringing enough speed out of it to run inside a mobile app (getting it fast on Android was particularly challenging). The payoff is a denoiser that cleans voice recordings right on the phone, wherever it was recorded, built on a model that has been around for a few years and is still hard to beat for what we need.

voice model semamba device elevenlabs recordings

Related Articles