LLMKube – A Kubernetes operator for local LLMs across Nvidia and Mac fleets

LLMKube - Kubernetes for Local LLMs

v0.7.9 Open Source · Kubernetes Native · NVIDIA + Apple Silicon · vLLM + llama.cpp + mlx-server Run production LLMs on your own hardware A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon. Recently a local model on two $400 GPUs wrote its own next feature, merged as PR #283. Get Started View on GitHub

Star Join the Discord • What's new in 0.7.9: a new mlx-server runtime for Apple Silicon, plus kubectl scale support →

See it in action Deploy LLMs with any runtime in seconds using the llmkube CLI

terminal

$ llmkube deploy llama-3.1-8b --gpu --runtime vllm 🚀 Deploying LLM inference service ═══════════════════════════════════════════════ Name: llama-3.1-8b Runtime: vllm Accelerator: cuda GPU: 2 x nvidia

📦 Creating Model 'llama-3.1-8b'... ✅ Model created

⚙️ Creating InferenceService 'llama-3.1-8b'... ✅ InferenceService created (runtime: vllm)

Step 1/4: Deploy with vLLM runtime

What's happening here Recent posts from the lab

Releases 8 min read What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support 0.7.9 adds mlx-server as a first-class runtime on the metal-agent: an OpenAI-compatible MLX inference server you select with --runtime… Read →Releases 8 min read What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story 0.7.8 lands ModelRouter Phase 1: a single OpenAI-compatible endpoint that dispatches across local InferenceServices and external providers… Read →Releases 7 min read What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix 0.7.7 makes OpenShift a first-class deploy target, lands the vllm-swift runtime with TurboQuant KV cache passthrough on Apple Silicon… Read → All posts →

Why LLMKube? Local LLMs are great for prototyping. Scaling them for a team is where it gets hard.

The scaling problem × Silent failures with no alerts × Multi-GPU memory math by trial and error × Updates that break your setup × Docker Compose that doesn't scale × One person managing everything × Every machine set up by hand

With LLMKube Pluggable runtimes: vLLM, TGI, llama.cpp, or bring your own HPA autoscaling that responds to real inference metrics GPU layer offloading with custom sharding splits Infrastructure as code, not scripts and duct tape Grafana dashboards for inference metrics out of the box CUDA 13 and NVIDIA Blackwell GPU support

vLLM for speed. TGI for flexibility. llama.cpp for efficiency. LLMKube for all of them. One operator, every runtime. The platform layer your inference stack is missing.

Deploy an LLM in seconds Simple, declarative YAML that feels native to Kubernetes developers

1. Define Model Specify the LLM to use 2. llama.cpp Service Deploy with autoscaling 3. vLLM Service Swap runtime in one line 4. Use API Query your LLM

Copy apiVersion: inference.llmkube.dev/v1alpha1 kind: Model metadata: name: phi-3-mini spec: source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf format: gguf quantization: Q4_K_M hardware: accelerator: cuda gpu: enabled: true count: 1 resources: cpu: "2" memory: "4Gi" Supports GGUF models from HuggingFace, with automatic download and caching Next

Limited to 10 Teams Early Adopter Program Help shape the future of LLMKube and get direct access to the maintainer.

What you get Private Discord with other early adopters Direct input on the roadmap Your logo on our website (when ready) Early access to new features

What we need Real-world feedback on your use case 30 minutes monthly for a feedback call Permission to share your story (anonymized if needed)

Apply to join Don't fill this out if you're human: Name * Work Email *

Company * Role

What are you trying to build? * Apply Now

Ready to deploy your first LLM? Join the community of developers deploying LLMs on Kubernetes. Get Started on GitHub Read the Docs Open source and free forever

LLMKube – A Kubernetes operator for local LLMs across Nvidia and Mac fleets

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play