Transcribing my old podcast locally with open-source AI

Mojah1 pts0 comments

Transcribing my old podcast locally with open-source AI · ma.ttias.beTranscribing my old podcast locally with open-source AI

Mattias GeniarBack in 2016 and 2017 I recorded a podcast called Syscast<br>: interviews with people I admired in the Linux, open source and infrastructure world. Matt Holt about Caddy<br>, Daniel Stenberg about curl<br>, Seth Vargo about Vault<br>, and a handful more. Ten episodes, roughly ten hours of audio, and then life got in the way and I put it on pause.<br>The one thing those episodes never had was transcripts. I always wanted them. Audio is nice, but you can&rsquo;t search it, you can&rsquo;t skim it, and Google can&rsquo;t read it. The problem was that in 2016, transcribing ten hours of two-person interviews yourself just wasn&rsquo;t realistic. Decent speech-to-text was a paid cloud service, and telling two speakers apart was basically a research project.<br>It&rsquo;s 2026 now, so I did it in an evening, on my own laptop, with open-source models and no API bill. Here&rsquo;s how.<br>The stack#<br>Two open-source pieces do the work:<br>WhisperX<br>wraps OpenAI&rsquo;s Whisper<br>large-v3 model for the actual speech-to-text, with word-level timestamps.<br>pyannote.audio<br>handles the speaker diarization.<br>&ldquo;Diarization&rdquo; was a new word to me when I started this. It&rsquo;s the step that splits a recording up by speaker: this stretch is one voice, this stretch is another, without knowing who either of them is yet. Whisper writes down what gets said; pyannote works out who said it. Put the two together and a two-person interview reads as a real back-and-forth instead of one long undivided block.<br>Both run locally. The audio never leaves the machine and there&rsquo;s no per-minute cost. The only thing you need from the outside world is a free Hugging Face account.<br>Gated models on Hugging Face#<br>The diarization models are gated on Hugging Face. No idea why. Are they dangerous and you need to sign a waver? Who knows. 🤷‍♂️<br>All I did was create a (free) account, a read token and clicked &ldquo;agree&rdquo; on the model pages before I could download them. I hadn&rsquo;t first, and WhisperX greeted me with this:<br>Could not download 'pyannote/speaker-diarization-3.1' pipeline.<br>It might be because the pipeline is private or gated...<br>That reads like a network or auth bug, but it just means the licence wasn&rsquo;t accepted yet. If you try this, accept the conditions on pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 first, drop your token in ~/.hf_token, and it works.<br>The pipeline#<br>Setup is a virtualenv and one install (I used uv<br>):<br>uv venv .venv-whisper<br>uv pip install --python .venv-whisper whisperx

The core is about a dozen lines of WhisperX. Load the model on CPU with int8 quantization (Apple Silicon has no usable CUDA path for this stack), transcribe, align for accurate word-level timestamps, then diarize and assign each word to a speaker:<br>import whisperx

audio = whisperx.load_audio(mp3_path)

# 1. transcribe with Whisper large-v3<br>model = whisperx.load_model("large-v3", "cpu", compute_type="int8", language="en")<br>result = model.transcribe(audio, batch_size=1, language="en")

# 2. align, for accurate word-level timestamps<br>align_model, meta = whisperx.load_align_model(language_code="en", device="cpu")<br>result = whisperx.align(result["segments"], align_model, meta, audio, "cpu")

# 3. diarize (who spoke when), then tag each word with a speaker<br>diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device="cpu")<br>result = whisperx.assign_word_speakers(diarize(audio), result)

That gives me a list of segments, each with a speaker, start, end and text. Batching all ten episodes is just a loop over the mp3s, logging the wall-clock time per file (that&rsquo;s where the benchmark below comes from):<br>for episode in static/podcast/episodes/*.mp3; do<br>start=$(date +%s)<br>.venv-whisper/bin/python scripts/transcribe-syscast.py "$episode"<br>printf '%s\t%ss\n' "$(basename "$episode")" "$(( $(date +%s) - start ))"<br>done

Raw WhisperX output is choppy: lots of short segments, SPEAKER_00/SPEAKER_01 labels (just whoever talked first and second), no paragraphs:<br>[0:00] SPEAKER_00: Welcome to a new episode of Syscast. My name is Mattias Geniar and today I'm joined by Seth Vargo from HashiCorp.<br>[0:14] SPEAKER_01: Hey Mattias, I'm good. Doing well over here in Pittsburgh.<br>A small cleanup step merges consecutive segments from the same speaker into one turn, then splits long turns into paragraphs every few sentences:<br>turns = []<br>for seg in segments:<br>if turns and turns[-1]["spk"] == seg["speaker"]:<br>turns[-1]["text"] += " " + seg["text"].strip() # same speaker, keep merging<br>else:<br>turns.append({"spk": seg["speaker"], "start": int(seg["start"]), "text": seg["text"].strip()})

The last touch is mapping SPEAKER_00 to &ldquo;Mattias&rdquo; and SPEAKER_01 to the guest (whose name is right there in the episode title), and fixing the obvious mis-hearings. Whisper was very confident my name is &ldquo;Matthias Genjar&rdquo;. 😁<br>The...

rsquo whisperx speaker audio text whisper

Related Articles