Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

Whissle Gateway — Run VoiceAI LocallyWhissle — Personal AI for Research, Voice, and Everyday Tasks

🔒 Service Notice: Our cloud services are temporarily down — meanwhile we're reinforcing our on-prem AI. hello@whissle.ai 🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai

Run VoiceAI locally ASR, TTS, voice calling, diarization, metadata, AI coaching — one Docker command.Models download automatically. No cloud dependency. API DocsSolutions Quick startCopy $ docker run -d --name whissle \ -p 9000:9000 -p 8001:8001 -p 8003:8003 \ -v whissle-models:/models -v whissle-data:/data \ -e VARIANT=en-full \ -e ANTHROPIC_API_KEY=your-key \ whissleasr/whissle-gateway:latestVARIANT=hinglishen-liteen-fullmulti-fullmulti-zhall DEVICE=cpucuda

en-full · Downloads ~2 GB on first run (cached after)

What happens when you run it: ═══════════════════════════════════════════════ Whissle Gateway — en-full ═══════════════════════════════════════════════ No GPU detected → using CPU

Shared models: ✓ speaker encoder + VAD 26 MB ✓ punctuation 254 MB ✓ ITN (English + Hinglish) 1.5 MB

Variant: en-full ✓ en-in-tech-misc (485 MB) ✓ KenLM ENGLISH (1.5 GB)

Auth: Mode: local Token: wh_a1b2c3d4e5f6... (admin) Manage: curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens

Starting services... PostgreSQL: :5432 ● ASR: :8001 ● TTS: :8003 ● Agent: :8765 ● Pipecat: :8000 ● Gateway: :9000 ●

API Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent. BatchStreamingTTSVoiceAgent POST localhost:8001/transcribe $ curl -X POST http://localhost:8001/transcribe \ -F "file=@call.mp3" \ -F "diarize=true" \ -F "num_speakers=2" \ -F "punctuation=true" \ -F "metadata_prob=true" \ -F "summarize=sales_coaching" \ -o result.json Response — transcript + metadata per segment + AI analysis "segments": [ "speaker": "SPEAKER_00", "text": "Hello, good morning.", "start": 1.0, "end": 1.9, "metadata": { "emotion": "EMOTION_NEUTRAL", "behavior": "BEHAVIOR_DIRECT", "role": "ROLE_INTERVIEWER", "age": "AGE_30_45", "gender": "GENDER_MALE" }, "words": [{"word": "Hello", "start": 1.0, "end": 1.3}] ], "analysis": { "overall_score": 78, "buyer_outcome": "Converted", "practices": { "followed": 6, "total": 8 }, "highlights": [...]

Parameters All parameters for POST /transcribe. ParameterTypeDefaultDescriptionfilefilerequiredAudio file (MP3, WAV, FLAC, OGG, M4A)languagestringautoLanguage hint: en, hi, zhdiarizeboolfalseSpeaker diarizationnum_speakersintautoExact speaker count (if known)punctuationbooltrueRestore punctuation and capitalizationitnbooltrueInverse text normalization (numbers, currency)use_lmbooltrueKenLM language model beam searchmetadata_probboolfalseProbability distributions for metadataword_timestampsboolfalsePer-word start/end timestampsspeech_analysisboolfalseSpeech patterns (pace, fillers, fluency)summarizestring—AI analysis: true, sales_coaching, collections, or custom prompthotwordsstring—Comma-separated hotwords for boosting

AI analysis modes Add -F "summarize=mode" to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis. sales_coachingSales Coaching 8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.

collectionsCollections Compliance Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.

trueGeneral Summary Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.

your prompt hereCustom Prompt Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.

Models Each model extracts different metadata in a single ASR forward pass — no separate models or API calls. en-in-tech-misc 485 MB BEHAVIOREMOTIONEVALROLEAGEGENDERENTITY 120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels. English · 6 heads, 51 classes

hinglish-loans 479 MB INTENTEMOTIONROLEAGEGENDERENTITY 115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection. Hindi-English · 5 heads, 26 classes

zh 627 MB DIALECTAGEGENDERENTITY 160M params, Mandarin with North/South dialect detection. Mandarin · 3 heads, 12 classes

whissle-large 2.4 GB INTENTEMOTIONAGEGENDERENTITY 600M params, inline action tokens. 31 intent groups, 18K vocabulary. 23 languages · 5,500+ action tokens

Kokoro TTS 82 MB 55 voices Non-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included. 10 languages · Baked in

Punctuation + ITN 255 MB CapitalizationNumbers Punctuation restoration and inverse text normalization. EN + Hinglish · Auto-downloaded

Metadata per segment Every segment includes these tags. Common tags appear on all...

Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y