Modelplane

hasheddan1 pts0 comments

Modelplane · The open source control plane for AI inference<br>In early development · building in the open<br>The open source<br>control plane for<br>AI inference<br>Install Modelplane in your own environment, and it operates your GPU clusters across cloud, neocloud, and on-premise as one inference fleet: provisioning clusters, placing models, autoscaling replicas, caching weights, and routing through a single OpenAI-compatible endpoint. It runs any model on any serving engine on any infrastructure, all under your control.<br>Get started →View on GitHub

ModelDeployment<br>POSTSGLang<br>deepseek-r1<br>prefill / decode8× B200

ModelDeployment<br>POSTvLLM<br>llama-4-70b<br>tensor parallel4× H100

ModelDeployment<br>POSTTRT-LLM<br>qwen3-235b<br>data / expert8× H200

reconciling

provisioningschedulingautoscalingroutingcaching<br>policygovernancecompliance

InferenceCluster<br>GCP<br>Cloud · us-central1<br>256× TPU v6e8× H100

InferenceCluster<br>CoreWeave<br>Neocloud · gpu-east<br>72× GB2008× H200

InferenceCluster<br>DGX<br>On-prem · dc-1<br>32× H1008× A100

Created by

Built on

The inference ecosystem. Under one control plane.<br>Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.

composesprovisionsschedulesautoscalesroutescaches

orchestrates<br>Models<br>open weights & custom

LlamaQwenDeepSeekMistralgpt-ossGemma+ any open-weight model

Serving<br>inference engines

vLLMSSGLangTensorRT-LLMTGIlllama.cppLLMDeploy+ any engine

Infrastructure<br>providers & accelerators

Providers<br>AWSGCPAzureCoreWeaveLambdaoon-prem+ any Kubernetes

Accelerators<br>NVIDIAAMDGoogle TPUAWS TrainiumIntel Gaudi+ any accelerator

Advanced serving. From single GPU to frontier.<br>Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge.<br>tensor parallel<br>Split each layer across GPUs in a node for low-latency single-model serving.

→→pipeline parallel<br>Stage a model across nodes so very large models fit beyond a single box.

data / expert<br>Replicate workers, or shard experts across them for MoE throughput.

→prefill / decode<br>Disaggregate prefill and decode onto separate pools for frontier serving.

+ emerging topology<br>Described as shape, so future parallelism strategies just work.

A resource API for inference. Serving two roles.<br>Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.<br>Development & ML teams<br>Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.

kind: ModelService<br>name: prod-llama<br>routing: weighted, openai

60kind: ModelDeployment<br>model: llama-4-70b<br>cluster: aws-us-east

30kind: ModelDeployment<br>model: llama-4-70b<br>cluster: gcp-eu-west

10kind: ModelEndpoint<br>target: vendor-api<br>type: managed

Platform teams<br>Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.

kind: InferenceGateway<br>name: prod-gateway<br>routes: all endpoints

kind: InferenceCluster<br>name: aws-us-east<br>pools: h200, h100

kind: InferenceCluster<br>name: gcp-eu-west<br>pools: tpu-v6e, a100

kind: InferenceCluster<br>name: onprem-dc1<br>pools: h100, l40s

Capabilities built for the fleet. Not just the cluster.<br>01 / Provisioning<br>Provision the fleet, or bring your own<br>Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, and the full serving stack, installed and continuously reconciled, all fronted by one central inference gateway.

Provisioning<br>Provision · GKE / EKSBring your own · any K8s

Modelplane installs & reconciles

InferenceCluster● reconciled<br>classes: h200-8x, h100-8x · node pools<br>✓ GPU operator & drivers<br>✓ Serving engines

02 / Scheduling<br>One global pool of capacity<br>Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA.

Two-level scheduling<br>fleet scheduler<br>one global pool<br>tracks requirements<br>↔ capabilities

→places replicas<br>aws-us-east<br>gcp-eu-west<br>azure-us2

→cluster scheduler<br>DRA<br>bound

03 / Autoscaling<br>Scale replicas across clouds and regions<br>Every model exposes the standard Kubernetes...

model serving inference across modelplane fleet

Related Articles