LLMKube - Kubernetes for Local LLMs
Skip to content
v0.7.9 Open Source · Kubernetes Native · NVIDIA + Apple Silicon · vLLM + llama.cpp + mlx-server<br>Run production LLMs<br>on your own hardware<br>A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon.<br>Recently a local model on two $400 GPUs wrote its own next feature, merged as PR #283.<br>Get Started View on GitHub
Star Join the Discord • What's new in 0.7.9: a new mlx-server runtime for Apple Silicon, plus kubectl scale support →
See it in action<br>Deploy LLMs with any runtime in seconds using the llmkube CLI
terminal
$ llmkube deploy llama-3.1-8b --gpu --runtime vllm<br>🚀 Deploying LLM inference service<br>═══════════════════════════════════════════════<br>Name: llama-3.1-8b<br>Runtime: vllm<br>Accelerator: cuda<br>GPU: 2 x nvidia
📦 Creating Model 'llama-3.1-8b'...<br>✅ Model created
⚙️ Creating InferenceService 'llama-3.1-8b'...<br>✅ InferenceService created (runtime: vllm)
Step 1/4: Deploy with vLLM runtime
What's happening here<br>Recent posts from the lab
Releases 8 min read<br>What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support<br>0.7.9 adds mlx-server as a first-class runtime on the metal-agent: an OpenAI-compatible MLX inference server you select with --runtime…<br>Read →Releases 8 min read<br>What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story<br>0.7.8 lands ModelRouter Phase 1: a single OpenAI-compatible endpoint that dispatches across local InferenceServices and external providers…<br>Read →Releases 7 min read<br>What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix<br>0.7.7 makes OpenShift a first-class deploy target, lands the vllm-swift runtime with TurboQuant KV cache passthrough on Apple Silicon…<br>Read →<br>All posts →
Why LLMKube?<br>Local LLMs are great for prototyping. Scaling them for a team is where it gets hard.
The scaling problem<br>× Silent failures with no alerts<br>× Multi-GPU memory math by trial and error<br>× Updates that break your setup<br>× Docker Compose that doesn't scale<br>× One person managing everything<br>× Every machine set up by hand
With LLMKube<br>Pluggable runtimes: vLLM, TGI, llama.cpp, or bring your own<br>HPA autoscaling that responds to real inference metrics<br>GPU layer offloading with custom sharding splits<br>Infrastructure as code, not scripts and duct tape<br>Grafana dashboards for inference metrics out of the box<br>CUDA 13 and NVIDIA Blackwell GPU support
vLLM for speed. TGI for flexibility. llama.cpp for efficiency. LLMKube for all of them.<br>One operator, every runtime. The platform layer your inference stack is missing.
Deploy an LLM in seconds<br>Simple, declarative YAML that feels native to Kubernetes developers
1. Define Model<br>Specify the LLM to use<br>2. llama.cpp Service<br>Deploy with autoscaling<br>3. vLLM Service<br>Swap runtime in one line<br>4. Use API<br>Query your LLM
Copy<br>apiVersion: inference.llmkube.dev/v1alpha1<br>kind: Model<br>metadata:<br>name: phi-3-mini<br>spec:<br>source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf<br>format: gguf<br>quantization: Q4_K_M<br>hardware:<br>accelerator: cuda<br>gpu:<br>enabled: true<br>count: 1<br>resources:<br>cpu: "2"<br>memory: "4Gi"<br>Supports GGUF models from HuggingFace, with automatic download and caching<br>Next
Limited to 10 Teams Early Adopter Program<br>Help shape the future of LLMKube and get direct access to the maintainer.
What you get<br>Private Discord with other early adopters<br>Direct input on the roadmap<br>Your logo on our website (when ready)<br>Early access to new features
What we need<br>Real-world feedback on your use case<br>30 minutes monthly for a feedback call<br>Permission to share your story (anonymized if needed)
Apply to join<br>Don't fill this out if you're human:<br>Name *<br>Work Email *
Company *<br>Role
What are you trying to build? *<br>Apply Now
Ready to deploy your first LLM?<br>Join the community of developers deploying LLMs on Kubernetes.<br>Get Started on GitHub Read the Docs<br>Open source and free forever