Gemma 4 for Telephony: From Two AI Models to One – Until I Switched to Chinese

Gemma 4 for Telephony: I Replaced Two AI Models With One in My Voice Phone Agent — Until I Switched to Chinese | by Jiyao Weng | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Gemma 4 for Telephony: I Replaced Two AI Models With One in My Voice Phone Agent — Until I Switched to Chinese

Jiyao Weng

9 min read· 19 hours ago

Listen

Building a phone agent on a multimodal LLM: dropping faster-whisper and letting Gemma 4 hear the caller directly — a response-time and reply-accuracy benchmark across English, French, and Mandarin Press enter or click to view image in full size

A telephony system using Gemma 4:12BMy voice phone agent uses two models: one to hear the caller, one to think. Gemma 4 can do both at once — so I tried deleting the speech-to-text model entirely. Across English, French, and Mandarin, here’s the head-to-head on response time and the thing that actually matters on a phone line: did it reply correctly. In English and French, one model beat two — faster and more accurate. In Chinese, it confidently told me a football field is 100 yards. Here’s the data.

The cascade, and the temptation to collapse it My voice phone agent uses the same architecture as almost every spoken assistant: a cascade of specialist models. caller audio ─▶ [ faster-whisper ] ─▶ text ─▶ [ Gemma LLM ] ─▶ reply ─▶ [ TTS ] ─▶ audio speech-to-text reasoningTwo models loaded, two GPU residents, two sequential hops before the caller hears anything. Then Gemma 4 shipped in Ollama with an audio capability — it can take audio directly as input. Which raises the obvious question: Why transcribe at all? If the LLM can hear the caller, it can understand and answer in one step. One model, one hop.

So I set up a clean head-to-head, holding the LLM constant so the only thing that changes is the input modality: Pipeline A — the cascade: audio → faster-whisper → text → gemma4:12b → reply Pipeline B — multimodal: audio → gemma4:12b (audio in) → reply Same model doing the reasoning in both. I measured the two things that actually matter for a phone agent: Reply correctness — did the agent answer the question correctly? Response time — how long until the reply is ready. A note on what to measure, because for a telephony system it changes the conclusion. The instinct is to score the speech-to-text by Word Error Rate. But a caller never hears the transcript — they hear the reply. What matters on a phone line is whether the agent did the right thing : answered the question, booked the table, cancelled the reservation. A transcript that drops a filler word or writes “7:00” for “seven o’clock” is harmless if the reply is still correct. So reply correctness is the primary metric here; transcription WER is only a diagnostic I use later to explain why the replies succeed or fail. Judge the system by the thing the caller actually experiences. Methodology I wrote 42 spoken caller turns (15 English, 15 French, 12 Mandarin) as natural speech via gTTS — factual questions, arithmetic, and task requests (bookings, cancellations) — each with a checkable answer key so correctness is objective, not vibes: Spoken turnCorrect iff reply contains”What is the capital of France?”paris”What is fifteen plus twenty-seven?”42 / forty-two”I’d like to book a table for four at seven tonight.”(four and seven)”法国的首都是哪里？”巴黎”二乘以八等于几？”16 / 十六 Grading is regex on the reply, case-insensitive, with Traditional→Simplified normalization for Chinese (faster-whisper and Gemma both emit Traditional characters; the keys are Simplified). Both pipelines used identical generation settings (temperature=0, reasoning disabled). Hardware: one 24 GB GPU, Ollama 0.30.7, gemma4:12b (Q4_K_M), faster-whisper medium. One setup gotcha worth saving you an hour: Ollama’s native /api/chat silently ignores the audios field — you get "there's no audio attached." Audio only works through the OpenAI-compatible /v1/chat/completions endpoint as an input_audio content part (base64, 16 kHz mono).

English: the one-model dream, realized PipelineReply accuracyMedian latencyA — cascade (faster-whisper → gemma4)93 %0.81 sB — multimodal (gemma4 audio)100 %0.66 s The multimodal model won on both axes. Every English question got a correct answer, and it did so faster than the cascade — because one model call beats two sequential ones (transcribe-then-reason). Side by side, they’re indistinguishable in quality: "What is two times eight?" A: Two times eight is 16. (0.79 s) B: Two times eight is sixteen. (0.60 s)"How many legs does a spider have?" A: A spider has eight legs. (0.76 s) B: A spider has eight legs. (0.61 s)(The single cascade “miss” wasn’t an error — asked to book a table, it replied “what’s the name of the restaurant?”, a sensible clarifying turn that just didn’t contain the answer-key words. The multimodal model happened to confirm the booking outright.) For an English voice agent, this is the result you hoped for: drop a whole model, get...

Gemma 4 for Telephony: From Two AI Models to One – Until I Switched to Chinese

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y