Modelplane · The open source control plane for AI inference<br>In early development · building in the open<br>The open source<br>control plane for<br>AI inference<br>Install Modelplane in your own environment, and it operates your GPU clusters across cloud, neocloud, and on-premise as one inference fleet: provisioning clusters, placing models, autoscaling replicas, caching weights, and routing through a single OpenAI-compatible endpoint. It runs any model on any serving engine on any infrastructure, all under your control.<br>Get started →View on GitHub
ModelDeployment<br>POSTSGLang<br>deepseek-r1<br>prefill / decode8× B200
ModelDeployment<br>POSTvLLM<br>llama-4-70b<br>tensor parallel4× H100
ModelDeployment<br>POSTTRT-LLM<br>qwen3-235b<br>data / expert8× H200
reconciling
provisioningschedulingautoscalingroutingcaching<br>policygovernancecompliance
InferenceCluster<br>GCP<br>Cloud · us-central1<br>256× TPU v6e8× H100
InferenceCluster<br>CoreWeave<br>Neocloud · gpu-east<br>72× GB2008× H200
InferenceCluster<br>DGX<br>On-prem · dc-1<br>32× H1008× A100
Created by
Built on
The inference ecosystem. Under one control plane.<br>Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.
composesprovisionsschedulesautoscalesroutescaches
orchestrates<br>Models<br>open weights & custom
LlamaQwenDeepSeekMistralgpt-ossGemma+ any open-weight model
Serving<br>inference engines
vLLMSSGLangTensorRT-LLMTGIlllama.cppLLMDeploy+ any engine
Infrastructure<br>providers & accelerators
Providers<br>AWSGCPAzureCoreWeaveLambdaoon-prem+ any Kubernetes
Accelerators<br>NVIDIAAMDGoogle TPUAWS TrainiumIntel Gaudi+ any accelerator
Advanced serving. From single GPU to frontier.<br>Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge.<br>tensor parallel<br>Split each layer across GPUs in a node for low-latency single-model serving.
→→pipeline parallel<br>Stage a model across nodes so very large models fit beyond a single box.
data / expert<br>Replicate workers, or shard experts across them for MoE throughput.
→prefill / decode<br>Disaggregate prefill and decode onto separate pools for frontier serving.
+ emerging topology<br>Described as shape, so future parallelism strategies just work.
A resource API for inference. Serving two roles.<br>Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.<br>Development & ML teams<br>Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.
kind: ModelService<br>name: prod-llama<br>routing: weighted, openai
60kind: ModelDeployment<br>model: llama-4-70b<br>cluster: aws-us-east
30kind: ModelDeployment<br>model: llama-4-70b<br>cluster: gcp-eu-west
10kind: ModelEndpoint<br>target: vendor-api<br>type: managed
Platform teams<br>Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.
kind: InferenceGateway<br>name: prod-gateway<br>routes: all endpoints
kind: InferenceCluster<br>name: aws-us-east<br>pools: h200, h100
kind: InferenceCluster<br>name: gcp-eu-west<br>pools: tpu-v6e, a100
kind: InferenceCluster<br>name: onprem-dc1<br>pools: h100, l40s
Capabilities built for the fleet. Not just the cluster.<br>01 / Provisioning<br>Provision the fleet, or bring your own<br>Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, and the full serving stack, installed and continuously reconciled, all fronted by one central inference gateway.
Provisioning<br>Provision · GKE / EKSBring your own · any K8s
Modelplane installs & reconciles
InferenceCluster● reconciled<br>classes: h200-8x, h100-8x · node pools<br>✓ GPU operator & drivers<br>✓ Serving engines
02 / Scheduling<br>One global pool of capacity<br>Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA.
Two-level scheduling<br>fleet scheduler<br>one global pool<br>tracks requirements<br>↔ capabilities
→places replicas<br>aws-us-east<br>gcp-eu-west<br>azure-us2
→cluster scheduler<br>DRA<br>bound
03 / Autoscaling<br>Scale replicas across clouds and regions<br>Every model exposes the standard Kubernetes...