SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

rombr6 pts0 comments

SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own | SkyPilot Blog

SkyPilot Endpoints is a next-gen LLM inference system designed for production-ready inference in multi-cluster environments. A single YAML deploys the full serving stack - engine, autoscaler, gateway, certificates, metrics - and runs it across any number of Kubernetes clusters under one endpoint URL with a focus on performance and production-readiness.<br>Multi-cluster inference made simple#<br>GPU supply is limited, and teams take capacity wherever they can get it - across clouds, regions, and on-prem.<br>But the Kubernetes-native LLM serving stack today (KServe, llm-d, Dynamo) is single-cluster, and operating it across the resulting fleet compounds both deployment and maintenance cost.<br>SkyPilot Endpoints provides the cross-cluster control plane on top. It sees registered Kubernetes clusters as one pool and handles:<br>Placement. On deploy, SkyPilot selects a cluster with sufficient GPU capacity for the configured replica count, accounting for preferences (region, cost, availability) declared in the YAML.<br>Scaling. When autoscaling adds replicas beyond the home cluster&rsquo;s capacity, additional replicas land on the next cluster with available GPUs.<br>Failure recovery. On cluster failure, replicas are recreated on healthy clusters. The endpoint URL does not change.<br>Clients see one endpoint URL; the infra team manages one spec across the fleet.<br>Deploy once. Place anywhere. Survive cluster failure.#<br>Below, SkyPilot autoscales replicas across clusters behind a single endpoint URL. Click the health dot on any cluster to terminate it and watch the replicas migrate.<br>Incoming traffic<br>0 RPS

↓ ↓ ↓ ↓ ↓ ↓Incoming traffic<br>SkyPilot Endpoints

Endpoint is idle. Drag the slider to send traffic.

One YAML, one dashboard#<br>The components of the modern LLM inference stack are great in isolation: inference engines (vLLM, SGLang, TensorRT-LLM), serving frameworks (KServe, llm-d, Dynamo), autoscaling (KEDA), KV cache-aware routing (Gateway API + Inference Extension), TLS (cert-manager), metrics and tracing (Prometheus, Alloy).<br>Assembling them in a performant configuration is tedious per-deployment work - engine tuning, autoscaling wired to the right Prometheus query, KV cache-aware routing rules, certificate plumbing - and keeping the stack alive through engine upgrades, CRD migrations, and version-compatibility checks is a recurring tax.<br>SkyPilot Endpoints replaces it with a single specification that deploys and manages inference across all your clusters. Here&rsquo;s a minimal spec for an endpoint:<br>name: glm-prod<br>model: zai-org/GLM-5.2<br>resources:<br>accelerators: B200:8<br>replicas: 2<br>routing: kv_cache_aware

$ sky endpoint up endpoint.yaml

Six lines to set up the whole stack from earlier - inference engine, serving framework, autoscaler, inference gateway, intelligent routing, metrics and more. SkyPilot handles setting up CRDs, wires up inference metrics to prometheus, installs KEDA when you turn on autoscaling and gives you a public (or private) URL. Works on every cluster you own.<br>Optional fields cover production knobs:<br>engine: — choose between vLLM, SGLang and more. Passthrough for all engine flags (max_model_len, enforce_eager, …), or override the entrypoint for custom engines.<br>routing: — KV cache-aware routing using Gateway API Inference Extension or P2C.<br>prefill: — prefill/decode disaggregation (heterogeneous GPU types supported).<br>volumes: — Attach shared model cache across replicas for faster cold starts.<br>autoscaling: — scale on kv_cache_utilization, queue_depth or custom PromQL metrics with tunable up/down delays. Scale-to-zero supported.<br>Rolling updates, auth/TLS, gated-model auth and more.<br>The underlying stack builds on battle-tested open-source frameworks - KServe and llm-d. vLLM works out of the box, support for more inference engines coming soon.<br>YAML in, dashboard out. One dashboard for the whole fleet — not one per cluster:<br>Overview — Pod health and replica spread across clusters.<br>Serving metrics — latency (TTFT, TPOT, end-to-end at p50/p95/p99), throughput (output tok/s, req/s), saturation (KV-cache util, queue depth, GPU util).<br>Logs — per-pod engine logs, including sidecars and init containers.<br>Chat playground — sanity-check the deployed model from your browser.

app.skypilot.co/endpoints/my-endpoint<br>Overview<br>Metrics<br>Logs<br>Playground

my-endpoint<br>Ready<br>v3 ▾<br>deployed 2h ago<br>Model<br>zai-org/GLM-5.2<br>MoE · FP8 · 32K ctx

Replicas<br>4 / 4<br>2 prefill · 2 decode

Fleet<br>32× H100 80GB<br>CoreWeave · Nebius

Pods<br>PodRoleRegionGPUUtilAgemy-endpoint-prefill-0PrefillCoreWeaveus-iad-18× H10088%2hmy-endpoint-decode-0DecodeCoreWeaveus-iad-18× H10074%2hmy-endpoint-prefill-1PrefillNebiuseu-north18× H10091%2hmy-endpoint-decode-1DecodeNebiuseu-north18× H10014%2m<br>Recent events<br>2m<br>Ready<br>decode-1 · Pod is ready — joined decode pool<br>12m<br>Pulled<br>decode-1 · Image vllm/vllm-openai pulled<br>14m<br>Scheduled<br>decode-1 · Assigned to Nebius eu-north1 to satisfy...

inference cluster endpoint skypilot across replicas

Related Articles