Modelplane

Modelplane · The open source control plane for AI inference In early development · building in the open The open source control plane for AI inference Install Modelplane in your own environment, and it operates your GPU clusters across cloud, neocloud, and on-premise as one inference fleet: provisioning clusters, placing models, autoscaling replicas, caching weights, and routing through a single OpenAI-compatible endpoint. It runs any model on any serving engine on any infrastructure, all under your control. Get started →View on GitHub

ModelDeployment POSTSGLang deepseek-r1 prefill / decode8× B200

ModelDeployment POSTvLLM llama-4-70b tensor parallel4× H100

ModelDeployment POSTTRT-LLM qwen3-235b data / expert8× H200

reconciling

provisioningschedulingautoscalingroutingcaching policygovernancecompliance

InferenceCluster GCP Cloud · us-central1 256× TPU v6e8× H100

InferenceCluster CoreWeave Neocloud · gpu-east 72× GB2008× H200

InferenceCluster DGX On-prem · dc-1 32× H1008× A100

Created by

Built on

The inference ecosystem. Under one control plane. Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.

composesprovisionsschedulesautoscalesroutescaches

orchestrates Models open weights & custom

LlamaQwenDeepSeekMistralgpt-ossGemma+ any open-weight model

Serving inference engines

vLLMSSGLangTensorRT-LLMTGIlllama.cppLLMDeploy+ any engine

Infrastructure providers & accelerators

Providers AWSGCPAzureCoreWeaveLambdaoon-prem+ any Kubernetes

Accelerators NVIDIAAMDGoogle TPUAWS TrainiumIntel Gaudi+ any accelerator

Advanced serving. From single GPU to frontier. Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge. tensor parallel Split each layer across GPUs in a node for low-latency single-model serving.

→→pipeline parallel Stage a model across nodes so very large models fit beyond a single box.

data / expert Replicate workers, or shard experts across them for MoE throughput.

→prefill / decode Disaggregate prefill and decode onto separate pools for frontier serving.

+ emerging topology Described as shape, so future parallelism strategies just work.

A resource API for inference. Serving two roles. Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath. Development & ML teams Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.

kind: ModelService name: prod-llama routing: weighted, openai

60kind: ModelDeployment model: llama-4-70b cluster: aws-us-east

30kind: ModelDeployment model: llama-4-70b cluster: gcp-eu-west

10kind: ModelEndpoint target: vendor-api type: managed

Platform teams Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.

kind: InferenceGateway name: prod-gateway routes: all endpoints

kind: InferenceCluster name: aws-us-east pools: h200, h100

kind: InferenceCluster name: gcp-eu-west pools: tpu-v6e, a100

kind: InferenceCluster name: onprem-dc1 pools: h100, l40s

Capabilities built for the fleet. Not just the cluster. 01 / Provisioning Provision the fleet, or bring your own Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, and the full serving stack, installed and continuously reconciled, all fronted by one central inference gateway.

Provisioning Provision · GKE / EKSBring your own · any K8s

Modelplane installs & reconciles

InferenceCluster● reconciled classes: h200-8x, h100-8x · node pools ✓ GPU operator & drivers ✓ Serving engines

02 / Scheduling One global pool of capacity Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA.

Two-level scheduling fleet scheduler one global pool tracks requirements ↔ capabilities

→places replicas aws-us-east gcp-eu-west azure-us2

→cluster scheduler DRA bound

03 / Autoscaling Scale replicas across clouds and regions Every model exposes the standard Kubernetes...

Modelplane

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI