Gemma 4 for Telephony: I Replaced Two AI Models With One in My Voice Phone Agent — Until I Switched to Chinese | by Jiyao Weng | Jun, 2026 | MediumSitemapOpen in appSign up<br>Sign in
Medium Logo
Get app<br>Write
Search
Sign up<br>Sign in
Gemma 4 for Telephony: I Replaced Two AI Models With One in My Voice Phone Agent — Until I Switched to Chinese
Jiyao Weng
9 min read·<br>19 hours ago
Listen
Share
Building a phone agent on a multimodal LLM: dropping faster-whisper and letting Gemma 4 hear the caller directly — a response-time and reply-accuracy benchmark across English, French, and Mandarin<br>Press enter or click to view image in full size
A telephony system using Gemma 4:12BMy voice phone agent uses two models: one to hear the caller, one to think. Gemma 4 can do both at once — so I tried deleting the speech-to-text model entirely. Across English, French, and Mandarin, here’s the head-to-head on response time and the thing that actually matters on a phone line: did it reply correctly. In English and French, one model beat two — faster and more accurate. In Chinese, it confidently told me a football field is 100 yards. Here’s the data.
The cascade, and the temptation to collapse it<br>My voice phone agent uses the same architecture as almost every spoken assistant: a cascade of specialist models.<br>caller audio ─▶ [ faster-whisper ] ─▶ text ─▶ [ Gemma LLM ] ─▶ reply ─▶ [ TTS ] ─▶ audio<br>speech-to-text reasoningTwo models loaded, two GPU residents, two sequential hops before the caller hears anything. Then Gemma 4 shipped in Ollama with an audio capability — it can take audio directly as input. Which raises the obvious question:<br>Why transcribe at all? If the LLM can hear the caller, it can understand and answer in one step. One model, one hop.
So I set up a clean head-to-head, holding the LLM constant so the only thing that changes is the input modality:<br>Pipeline A — the cascade: audio → faster-whisper → text → gemma4:12b → reply<br>Pipeline B — multimodal: audio → gemma4:12b (audio in) → reply<br>Same model doing the reasoning in both. I measured the two things that actually matter for a phone agent:<br>Reply correctness — did the agent answer the question correctly?<br>Response time — how long until the reply is ready.<br>A note on what to measure, because for a telephony system it changes the conclusion. The instinct is to score the speech-to-text by Word Error Rate. But a caller never hears the transcript — they hear the reply. What matters on a phone line is whether the agent did the right thing : answered the question, booked the table, cancelled the reservation. A transcript that drops a filler word or writes “7:00” for “seven o’clock” is harmless if the reply is still correct. So reply correctness is the primary metric here; transcription WER is only a diagnostic I use later to explain why the replies succeed or fail. Judge the system by the thing the caller actually experiences.<br>Methodology<br>I wrote 42 spoken caller turns (15 English, 15 French, 12 Mandarin) as natural speech via gTTS — factual questions, arithmetic, and task requests (bookings, cancellations) — each with a checkable answer key so correctness is objective, not vibes:<br>Spoken turnCorrect iff reply contains”What is the capital of France?”paris”What is fifteen plus twenty-seven?”42 / forty-two”I’d like to book a table for four at seven tonight.”(four and seven)”法国的首都是哪里?”巴黎”二乘以八等于几?”16 / 十六<br>Grading is regex on the reply, case-insensitive, with Traditional→Simplified normalization for Chinese (faster-whisper and Gemma both emit Traditional characters; the keys are Simplified). Both pipelines used identical generation settings (temperature=0, reasoning disabled). Hardware: one 24 GB GPU, Ollama 0.30.7, gemma4:12b (Q4_K_M), faster-whisper medium.<br>One setup gotcha worth saving you an hour: Ollama’s native /api/chat silently ignores the audios field — you get "there's no audio attached." Audio only works through the OpenAI-compatible /v1/chat/completions endpoint as an input_audio content part (base64, 16 kHz mono).
English: the one-model dream, realized<br>PipelineReply accuracyMedian latencyA — cascade (faster-whisper → gemma4)93 %0.81 sB — multimodal (gemma4 audio)100 %0.66 s<br>The multimodal model won on both axes. Every English question got a correct answer, and it did so faster than the cascade — because one model call beats two sequential ones (transcribe-then-reason). Side by side, they’re indistinguishable in quality:<br>"What is two times eight?"<br>A: Two times eight is 16. (0.79 s)<br>B: Two times eight is sixteen. (0.60 s)"How many legs does a spider have?"<br>A: A spider has eight legs. (0.76 s)<br>B: A spider has eight legs. (0.61 s)(The single cascade “miss” wasn’t an error — asked to book a table, it replied “what’s the name of the restaurant?”, a sensible clarifying turn that just didn’t contain the answer-key words. The multimodal model happened to confirm the booking outright.)<br>For an English voice agent, this is the result you hoped for: drop a whole model, get...