Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

ksingla0251 pts0 comments

Whissle Gateway — Run VoiceAI LocallyWhissle — Personal AI for Research, Voice, and Everyday Tasks

🔒 Service Notice: Our cloud services are temporarily down — meanwhile we're reinforcing our on-prem AI. hello@whissle.ai<br>🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai

Contact Us

Run VoiceAI locally<br>ASR, TTS, voice calling, diarization, metadata, AI coaching — one Docker command.Models download automatically. No cloud dependency.<br>API DocsSolutions<br>Quick startCopy<br>$ docker run -d --name whissle \<br>-p 9000:9000 -p 8001:8001 -p 8003:8003 \<br>-v whissle-models:/models -v whissle-data:/data \<br>-e VARIANT=en-full \<br>-e ANTHROPIC_API_KEY=your-key \<br>whissleasr/whissle-gateway:latestVARIANT=hinglishen-liteen-fullmulti-fullmulti-zhall<br>DEVICE=cpucuda

en-full · Downloads ~2 GB on first run (cached after)

What happens when you run it:<br>═══════════════════════════════════════════════<br>Whissle Gateway — en-full<br>═══════════════════════════════════════════════<br>No GPU detected → using CPU

Shared models:<br>✓ speaker encoder + VAD 26 MB<br>✓ punctuation 254 MB<br>✓ ITN (English + Hinglish) 1.5 MB

Variant: en-full<br>✓ en-in-tech-misc (485 MB)<br>✓ KenLM ENGLISH (1.5 GB)

Auth:<br>Mode: local<br>Token: wh_a1b2c3d4e5f6... (admin)<br>Manage: curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens

Starting services...<br>PostgreSQL: :5432 ●<br>ASR: :8001 ●<br>TTS: :8003 ●<br>Agent: :8765 ●<br>Pipecat: :8000 ●<br>Gateway: :9000 ●

API<br>Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent.<br>BatchStreamingTTSVoiceAgent<br>POST localhost:8001/transcribe<br>$ curl -X POST http://localhost:8001/transcribe \<br>-F "file=@call.mp3" \<br>-F "diarize=true" \<br>-F "num_speakers=2" \<br>-F "punctuation=true" \<br>-F "metadata_prob=true" \<br>-F "summarize=sales_coaching" \<br>-o result.json<br>Response — transcript + metadata per segment + AI analysis<br>"segments": [<br>"speaker": "SPEAKER_00",<br>"text": "Hello, good morning.",<br>"start": 1.0, "end": 1.9,<br>"metadata": {<br>"emotion": "EMOTION_NEUTRAL",<br>"behavior": "BEHAVIOR_DIRECT",<br>"role": "ROLE_INTERVIEWER",<br>"age": "AGE_30_45",<br>"gender": "GENDER_MALE"<br>},<br>"words": [{"word": "Hello", "start": 1.0, "end": 1.3}]<br>],<br>"analysis": {<br>"overall_score": 78,<br>"buyer_outcome": "Converted",<br>"practices": { "followed": 6, "total": 8 },<br>"highlights": [...]

Parameters<br>All parameters for POST /transcribe.<br>ParameterTypeDefaultDescriptionfilefilerequiredAudio file (MP3, WAV, FLAC, OGG, M4A)languagestringautoLanguage hint: en, hi, zhdiarizeboolfalseSpeaker diarizationnum_speakersintautoExact speaker count (if known)punctuationbooltrueRestore punctuation and capitalizationitnbooltrueInverse text normalization (numbers, currency)use_lmbooltrueKenLM language model beam searchmetadata_probboolfalseProbability distributions for metadataword_timestampsboolfalsePer-word start/end timestampsspeech_analysisboolfalseSpeech patterns (pace, fillers, fluency)summarizestring—AI analysis: true, sales_coaching, collections, or custom prompthotwordsstring—Comma-separated hotwords for boosting

AI analysis modes<br>Add -F "summarize=mode" to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis.<br>sales_coachingSales Coaching<br>8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.

collectionsCollections Compliance<br>Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.

trueGeneral Summary<br>Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.

your prompt hereCustom Prompt<br>Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.

Models<br>Each model extracts different metadata in a single ASR forward pass — no separate models or API calls.<br>en-in-tech-misc<br>485 MB<br>BEHAVIOREMOTIONEVALROLEAGEGENDERENTITY<br>120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.<br>English · 6 heads, 51 classes

hinglish-loans<br>479 MB<br>INTENTEMOTIONROLEAGEGENDERENTITY<br>115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.<br>Hindi-English · 5 heads, 26 classes

zh<br>627 MB<br>DIALECTAGEGENDERENTITY<br>160M params, Mandarin with North/South dialect detection.<br>Mandarin · 3 heads, 12 classes

whissle-large<br>2.4 GB<br>INTENTEMOTIONAGEGENDERENTITY<br>600M params, inline action tokens. 31 intent groups, 18K vocabulary.<br>23 languages · 5,500+ action tokens

Kokoro TTS<br>82 MB<br>55 voices<br>Non-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.<br>10 languages · Baked in

Punctuation + ITN<br>255 MB<br>CapitalizationNumbers<br>Punctuation restoration and inverse text normalization.<br>EN + Hinglish · Auto-downloaded

Metadata per segment<br>Every segment includes these tags. Common tags appear on all...

whissle metadata models gateway hello full

Related Articles