Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher | Technorage
Series · GNU/Linux Environment for Developers
Post 8 of 8
My beautiful Linux development environment<br>Must have GNOME extensions<br>Configure a beautiful terminal on Unix with Zsh<br>My VS Code setup - Making the most out of VS Code<br>The state of Linux as a daily use OS in 2021<br>My sleek and modern Linux development machine in 2021<br>My fully offline AI-assisted Linux development machine<br>Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher<br>In my recent post about my fully offline AI-assisted Linux development machine, I dropped a small detail near the bottom. I run my local model with an alias.
Terminal window1
llamaServer
I described it as “a small script. It lets me pick a GGUF model, context size, and reasoning mode. It remembers the last choice, so most of the time I just start it and get going.”
That script grew up. Today I’m releasing LlamaStash , the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
It is a TUI. It is also a CLI. It is also a background daemon. It is also an OpenAI-compatible proxy. One small Rust binary (~5 MB download), three personas, same primitives everywhere.
Why does this exist?
Well: ADHD, insomnia and the below 😆
Local LLMs sit in an awkward gap.
On one side, raw llama-server is fast and honest. It is also tedious. You memorize flags. You write wrapper scripts. You remember which port a model is on. You guess context sizes that may or may not fit your VRAM. After a while you have a ~/scripts/ folder full of shell aliases that nobody else can read.
On the other side, Ollama and LM Studio wrap llama.cpp in friendlier shells. Ollama is opinionated about model storage, format, and config. LM Studio is GUI-first and not terminal native. Both pay a real performance cost compared to raw llama-server, and both hide the underlying primitives that I actually like working with.
I wanted something in the middle. A launcher that:
Stays out of llama.cpp’s way (no fork, no patched copy, no opinions about its flags).
Is fast to invoke from a terminal and fast to drive from a script.
Is also good as a TUI, because I genuinely like terminal interfaces.
Treats agents and humans as equals. Anything a person can do in the TUI, an agent can do via --json.
Has a daemon underneath so models survive the TUI closing, and so multiple clients can hit the same model concurrently.
Exposes an OpenAI/Ollama-compatible proxy on loopback so any existing OpenAI client (your editor, your agent, your scripts) just works without per-model setup.
LlamaStash is that.
Why a TUI?
I love terminal UIs (see KDash, JWT-UI, and battleship-rs). I wrote KDash, a Kubernetes dashboard TUI in Rust. That was 2020. The Rust TUI ecosystem at the time was tui-rs and a lot of patience. Threading was DIY. Layouts were arithmetic. State management was you-figure-it-out.
Building LlamaStash brought me back to a lot of that, but the ground has shifted. ratatui (the maintained fork of tui-rs) is a real, polished framework now. tokio makes async daemons boring in a good way. hyper gives you a respectable HTTP server in a few hundred lines. crossterm handles the cross-platform terminal mess. sysinfo covers host metrics. The pieces are all there and you have LLMs to help you speed up everything to 10x.
I still believe what I wrote then. Rust gives you safety, speed, and a great UX without picking just one. LlamaStash is ~180 Rust files and not one production panic. It feels solid in a way that the JavaScript and Java tooling I shipped earlier in my career never quite did.
OK, enough nostalgia. Let me show you what the tool actually does.
Zero to chat in one command
Terminal window1
llamastash init
This is the first-run wizard. It detects your hardware, installs llama-server for your OS/GPU combo, looks at your available VRAM, recommends a GGUF model that fits, downloads it, writes a tuned config, updates configs for your AI tools (OpenCode, Zed, etc.), and smoke-launches it to make sure the whole pipeline works end-to-end.
On my Strix Halo machine that means an automatic ROCm/HIP path with sensible defaults. On a MacBook it picks up Metal. On an NVIDIA Linux box it picks up Vulkan (CUDA coming soon). On a Windows 11 machine it picks the matching win-cpu / win-cuda / win-vulkan llama.cpp asset. On an old laptop with no GPU it picks up CPU and quietly recommends a smaller model.
If you already have a llama.cpp build you like, point at it with --llama-server. If you already have GGUFs in ~/.cache/huggingface/, ~/.ollama/models, or ~/.lmstudio/models, LlamaStash discovers them. It also watches your model paths live, so a new download shows up without a restart.
Already have a coding model on disk? Skip the wizard.
Terminal window1
llamastash start qwen-coder --ctx 16384 --reasoning on
That’s the whole command. Or use the TUI and pick from a list.
The...