LLMKube – A Kubernetes operator for local LLMs across Nvidia and Mac fleets

richteach1 pts0 comments

LLMKube - Kubernetes for Local LLMs

Skip to content

v0.7.9 Open Source · Kubernetes Native · NVIDIA + Apple Silicon · vLLM + llama.cpp + mlx-server<br>Run production LLMs<br>on your own hardware<br>A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon.<br>Recently a local model on two $400 GPUs wrote its own next feature, merged as PR #283.<br>Get Started View on GitHub

Star Join the Discord • What's new in 0.7.9: a new mlx-server runtime for Apple Silicon, plus kubectl scale support →

See it in action<br>Deploy LLMs with any runtime in seconds using the llmkube CLI

terminal

$ llmkube deploy llama-3.1-8b --gpu --runtime vllm<br>🚀 Deploying LLM inference service<br>═══════════════════════════════════════════════<br>Name: llama-3.1-8b<br>Runtime: vllm<br>Accelerator: cuda<br>GPU: 2 x nvidia

📦 Creating Model 'llama-3.1-8b'...<br>✅ Model created

⚙️ Creating InferenceService 'llama-3.1-8b'...<br>✅ InferenceService created (runtime: vllm)

Step 1/4: Deploy with vLLM runtime

What's happening here<br>Recent posts from the lab

Releases 8 min read<br>What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support<br>0.7.9 adds mlx-server as a first-class runtime on the metal-agent: an OpenAI-compatible MLX inference server you select with --runtime…<br>Read →Releases 8 min read<br>What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story<br>0.7.8 lands ModelRouter Phase 1: a single OpenAI-compatible endpoint that dispatches across local InferenceServices and external providers…<br>Read →Releases 7 min read<br>What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix<br>0.7.7 makes OpenShift a first-class deploy target, lands the vllm-swift runtime with TurboQuant KV cache passthrough on Apple Silicon…<br>Read →<br>All posts →

Why LLMKube?<br>Local LLMs are great for prototyping. Scaling them for a team is where it gets hard.

The scaling problem<br>× Silent failures with no alerts<br>× Multi-GPU memory math by trial and error<br>× Updates that break your setup<br>× Docker Compose that doesn't scale<br>× One person managing everything<br>× Every machine set up by hand

With LLMKube<br>Pluggable runtimes: vLLM, TGI, llama.cpp, or bring your own<br>HPA autoscaling that responds to real inference metrics<br>GPU layer offloading with custom sharding splits<br>Infrastructure as code, not scripts and duct tape<br>Grafana dashboards for inference metrics out of the box<br>CUDA 13 and NVIDIA Blackwell GPU support

vLLM for speed. TGI for flexibility. llama.cpp for efficiency. LLMKube for all of them.<br>One operator, every runtime. The platform layer your inference stack is missing.

Deploy an LLM in seconds<br>Simple, declarative YAML that feels native to Kubernetes developers

1. Define Model<br>Specify the LLM to use<br>2. llama.cpp Service<br>Deploy with autoscaling<br>3. vLLM Service<br>Swap runtime in one line<br>4. Use API<br>Query your LLM

Copy<br>apiVersion: inference.llmkube.dev/v1alpha1<br>kind: Model<br>metadata:<br>name: phi-3-mini<br>spec:<br>source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf<br>format: gguf<br>quantization: Q4_K_M<br>hardware:<br>accelerator: cuda<br>gpu:<br>enabled: true<br>count: 1<br>resources:<br>cpu: "2"<br>memory: "4Gi"<br>Supports GGUF models from HuggingFace, with automatic download and caching<br>Next

Limited to 10 Teams Early Adopter Program<br>Help shape the future of LLMKube and get direct access to the maintainer.

What you get<br>Private Discord with other early adopters<br>Direct input on the roadmap<br>Your logo on our website (when ready)<br>Early access to new features

What we need<br>Real-world feedback on your use case<br>30 minutes monthly for a feedback call<br>Permission to share your story (anonymized if needed)

Apply to join<br>Don't fill this out if you're human:<br>Name *<br>Work Email *

Company *<br>Role

What are you trying to build? *<br>Apply Now

Ready to deploy your first LLM?<br>Join the community of developers deploying LLMs on Kubernetes.<br>Get Started on GitHub Read the Docs<br>Open source and free forever

llmkube runtime vllm llama inference deploy

Related Articles