Transcribing my old podcast locally with open-source AI

Transcribing my old podcast locally with open-source AI · ma.ttias.beTranscribing my old podcast locally with open-source AI

Mattias GeniarBack in 2016 and 2017 I recorded a podcast called Syscast : interviews with people I admired in the Linux, open source and infrastructure world. Matt Holt about Caddy , Daniel Stenberg about curl , Seth Vargo about Vault , and a handful more. Ten episodes, roughly ten hours of audio, and then life got in the way and I put it on pause. The one thing those episodes never had was transcripts. I always wanted them. Audio is nice, but you can’t search it, you can’t skim it, and Google can’t read it. The problem was that in 2016, transcribing ten hours of two-person interviews yourself just wasn’t realistic. Decent speech-to-text was a paid cloud service, and telling two speakers apart was basically a research project. It’s 2026 now, so I did it in an evening, on my own laptop, with open-source models and no API bill. Here’s how. The stack# Two open-source pieces do the work: WhisperX wraps OpenAI’s Whisper large-v3 model for the actual speech-to-text, with word-level timestamps. pyannote.audio handles the speaker diarization. “Diarization” was a new word to me when I started this. It’s the step that splits a recording up by speaker: this stretch is one voice, this stretch is another, without knowing who either of them is yet. Whisper writes down what gets said; pyannote works out who said it. Put the two together and a two-person interview reads as a real back-and-forth instead of one long undivided block. Both run locally. The audio never leaves the machine and there’s no per-minute cost. The only thing you need from the outside world is a free Hugging Face account. Gated models on Hugging Face# The diarization models are gated on Hugging Face. No idea why. Are they dangerous and you need to sign a waver? Who knows. 🤷‍♂️ All I did was create a (free) account, a read token and clicked “agree” on the model pages before I could download them. I hadn’t first, and WhisperX greeted me with this: Could not download 'pyannote/speaker-diarization-3.1' pipeline. It might be because the pipeline is private or gated... That reads like a network or auth bug, but it just means the licence wasn’t accepted yet. If you try this, accept the conditions on pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 first, drop your token in ~/.hf_token, and it works. The pipeline# Setup is a virtualenv and one install (I used uv ): uv venv .venv-whisper uv pip install --python .venv-whisper whisperx

The core is about a dozen lines of WhisperX. Load the model on CPU with int8 quantization (Apple Silicon has no usable CUDA path for this stack), transcribe, align for accurate word-level timestamps, then diarize and assign each word to a speaker: import whisperx

audio = whisperx.load_audio(mp3_path)

# 1. transcribe with Whisper large-v3 model = whisperx.load_model("large-v3", "cpu", compute_type="int8", language="en") result = model.transcribe(audio, batch_size=1, language="en")

# 2. align, for accurate word-level timestamps align_model, meta = whisperx.load_align_model(language_code="en", device="cpu") result = whisperx.align(result["segments"], align_model, meta, audio, "cpu")

# 3. diarize (who spoke when), then tag each word with a speaker diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device="cpu") result = whisperx.assign_word_speakers(diarize(audio), result)

That gives me a list of segments, each with a speaker, start, end and text. Batching all ten episodes is just a loop over the mp3s, logging the wall-clock time per file (that’s where the benchmark below comes from): for episode in static/podcast/episodes/*.mp3; do start=$(date +%s) .venv-whisper/bin/python scripts/transcribe-syscast.py "$episode" printf '%s\t%ss\n' "$(basename "$episode")" "$(( $(date +%s) - start ))" done

Raw WhisperX output is choppy: lots of short segments, SPEAKER_00/SPEAKER_01 labels (just whoever talked first and second), no paragraphs: [0:00] SPEAKER_00: Welcome to a new episode of Syscast. My name is Mattias Geniar and today I'm joined by Seth Vargo from HashiCorp. [0:14] SPEAKER_01: Hey Mattias, I'm good. Doing well over here in Pittsburgh. A small cleanup step merges consecutive segments from the same speaker into one turn, then splits long turns into paragraphs every few sentences: turns = [] for seg in segments: if turns and turns[-1]["spk"] == seg["speaker"]: turns[-1]["text"] += " " + seg["text"].strip() # same speaker, keep merging else: turns.append({"spk": seg["speaker"], "start": int(seg["start"]), "text": seg["text"].strip()})

The last touch is mapping SPEAKER_00 to “Mattias” and SPEAKER_01 to the guest (whose name is right there in the episode title), and fixing the obvious mis-hearings. Whisper was very confident my name is “Matthias Genjar”. 😁 The...

Transcribing my old podcast locally with open-source AI

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y