llama-dash — self-hosted inference gatewaygateway online|running 3 · peer 2|req/s 0.87<br>LOCAL INFERENCE CONTROL PLANE<br>One control plane for local inference.<br>Monitor models, requests, API keys, routing rules, and proxy metrics from one dashboard for llama-swap and compatible upstreams.<br>Read the docs →Star on GitHub<br>WORKS WITHOpenAI SDK·Claude Code·Continue·Open WebUI
OPERATOR DASHBOARD2026-04-30 · 22:01<br>REQ/S · 1M<br>0.87
P50 LATENCY<br>1.83 s
MODEL RESIDENCY · 60 MIN<br>gemma-4-26B
kokoro · peer
qwen-3.6-37B
RECENT REQUESTS<br>/v1/messages● 200950 ms<br>/v1/chat/completions● 2003.29 s<br>/v1/messages● 200644 ms
REQUEST PIPELINE<br>CLIENTS<br>OpenAI SDK<br>Claude Code<br>Continue · Open WebUI
──▶<br>llama-dash :3000<br>dashboard · auth · logs<br>routing · metrics
──▶<br>llama-swap :8080<br>llama.cpp · peers
direct /v1 upstreams<br>OpenAI · Anthropic
WHAT IT DOES<br>D01<br>Watch the box<br>Live request, token, model, upstream, and GPU status in one dashboard.
M05<br>Manage models<br>Load, unload, inspect per-model stats, and edit llama-swap config with validation.
R02<br>Track requests<br>Searchable history with filters, histograms, token counts, and cost estimates.
K08<br>Control access<br>Hashed API keys, per-key RPM/TPM limits, and model allow-lists.
P10<br>Enforce policy<br>Routing rules for model rewrites, passthrough auth, and encrypted credentials.
P06<br>Test models<br>Playgrounds for chat, image, speech, and article-to-speech transcription.