Reachy Mini bot goes local

Reachy Mini goes fully local

Back to Articles

Reachy Mini goes fully local

Published May 27, 2026 Update on GitHub Upvote 33

+27

Amir Mahla A-Mahla Follow

Andres Marafioti andito Follow

After building your Reachy Mini, you'll install the conversation app and start talking to it. Until now, you had to send your audio to a server. But not anymore. Today we'll walk you through running the whole stack locally.

This stack is powered by speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket. Once you launch the backend, point the robot at it from the UI.

Cascades are the most flexible option in the open-source landscape today, and with the right pieces they're also the fastest. We'll recommend the components we like best, but the whole point of a cascade is that you can swap them. New models drop every week.

TL;DR

Deploy a local speech backend for your Reachy Mini.

We use our speech-to-speech library, a cascade approach.

Recommended: llama.cpp with Gemma 4 , Silero VAD , Parakeet-TDT 0.6B v3 STT , Qwen3-TTS .

Quick start

This blog walks you through running conversations with Reachy Mini fully locally. No cloud, no API keys, no data leaving your machine. Here's a video showing this live:

Locally serving the LLM

To serve the LLM, we'll use Hugging Face's llama.cpp. If you need to install it, the simplest way is brew install llama.cpp or winget install llama.cpp, for more help, check the docs. First, we'll run:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

And done! The first time it will download the model, subsequent launches are fast.

What do those flags do?

-hf ggml-org/gemma-4-E4B-it-GGUF — pulls the model straight from the Hub. First run downloads it, subsequent runs use the cache.

-np 2 — two parallel slots. Lets the server handle a second request (e.g. a quick interruption) without blocking on the first.

-c 65536 — 64k context window, shared across slots. Plenty of headroom for long conversations.

-fa on — flash attention. Faster and lower memory, basically free on modern hardware.

--swa-full — keeps the full sliding-window attention cache instead of recomputing it. Trades a bit of RAM for noticeably faster prompt processing on Gemma.

Setting up speech-to-speech

We'll begin by simply installing the library

uv pip install speech-to-speech

Then, while we are serving the LLM in another terminal, we can simply run:

speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local

And you can start talking to the model through your terminal! The first time it will need to download Parakeet-TDT 0.6B v3 and Qwen3TTS, but subsequent launches are fast.

Here's a video showing the local conversation mode:

Now, after you've tried it in --mode local, you can run again the command without that option to serve speech-to-speech to the robot.

Connecting Reachy Mini to speech-to-speech

Once you have llama.cpp and speech-to-speech running, you can start the robot with the desktop app and launch the conversation app. In the UI from the conversation app, you need to choose the local mode by clicking on "edit connection" in the HF backend. Here's a video showing how to do it:

And you're done. You can start talking to your robot. Every stage of the pipeline is a trade-off: there are faster TTS models with lower quality, slower STT models with higher quality. We optimized for multilingual, you might want to optimize for a single language. The rest of the blog covers how to customize.

Going deeper

Why run your own Speech-to-Speech server?

Hosted realtime backends are convenient, but running your own engine unlocks three things:

Privacy. Audio never leaves your network, the entire pipeline runs on hardware you control.

No API costs. No per-minute or per-token fees.

Full control over the pipeline. Swap any piece: VAD, STT, LLM, TTS. Whenever something better lands on the Hub 🤗.

The speech-to-speech repo gives you all of that in a single CLI. It boots a WebSocket server at /v1/realtime that speaks the same protocol Reachy Mini already knows how to talk to.

Our opinionated defaults: VAD, STT, TTS

A cascaded voice pipeline has four stages: VAD, STT, LLM, and TTS. For three of them, we pick solid defaults so you can focus on the LLM:

Stage Choice Why

VAD Silero VAD v5 Tiny, accurate, runs on CPU. The de-facto default in the open-source voice-agent world.

STT Parakeet-TDT 0.6B v3 Streaming-friendly, very fast, great quality on English.

TTS Qwen3-TTS Expressive, low-latency, multilingual, supports custom voices.

We are opinionated about these choices, feel free to swap them out for your own if you have a preference.

Choosing your LLM

The LLM is the layer with the most impact on latency and overall performance of the system. We support two options: run a model locally (llama.cpp, MLX, Transformers, vLLM), or use a server with a...

Reachy Mini bot goes local

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine