NVCF Is Now Open Source: Inside NVIDIA's GPU Function Platform | Kubesimplify
On this page (9)What NVCF Actually Is<br>Three-Plane Architecture<br>How a Single Request Flows<br>Scale-to-Zero: The NATS Buffer Approach<br>Multi-Cluster by Design<br>Setting Up Locally (What Works Without NGC Access)<br>How NVCF Compares<br>What the Open Source Release Actually Changes<br>Links<br>NVIDIA just open-sourced the full NVCF platform under Apache 2.0. Not a thin SDK, not a client library. The actual control plane, invocation plane, compute plane, CLIs, Helm charts, and database migrations, all in one monorepo at github.com/nvidia/nvcf.
NVCF powers infrastructure behind services like build.nvidia.com and NVIDIA-hosted inference workflows across GPU cloud providers and DGX Cloud environments.. Now you can run the whole thing yourself and read every line that makes it work.
Let’s break down how the platform actually works.
What NVCF Actually Is #
NVCF stands for NVIDIA Cloud Functions. The original managed service let you register a Docker container or Helm chart, specify a GPU type, and NVIDIA handled everything: routing, queueing, autoscaling, multi-tenant isolation. GPU cloud partners like CoreWeave ran the NVIDIA Cluster Agent on their Kubernetes clusters so their GPUs could serve functions while NVIDIA owned the control plane.
The April 2026 Apache 2.0 release publishes that control plane. The previous repos (NVIDIA/nvidia-cloud-functions, NVIDIA/nvcf-go) are now archived. This monorepo is the one place everything lives.
One honest caveat: the control plane images are currently distributed via NVIDIA's NGC registry under the nvcf-onprem org. You need NGC access to deploy the full stack today. The source code is all Apache 2.0 and inspectable, but the deployable bundle still goes through NGC while issue #12 (full OSS build) is open. I opened issue #14 asking for a community contributor path.
Three-Plane Architecture #
The entire platform is built around three independently scalable planes connected through NATS JetStream.
Control Plane runs on a dedicated Kubernetes cluster and owns function lifecycle, autoscaling decisions, and secrets management. Key services:
function-autoscaler (Rust): runs a 30-second scaling loop, reads utilization from VictoriaMetrics, writes decisions to Cassandra, calls the NVCF API to set desired instance counts
helm-reval (Go): validates OCI-referenced Helm charts before the compute plane deploys them
OpenBao (Apache 2.0 Vault fork): all function secrets encrypted at rest, injected at runtime via the ess-agent sidecar
Cassandra: persistent state and distributed locks for the autoscaler
Invocation Plane sits between every caller and every GPU worker. Nothing bypasses it:
http-invocation (Rust / Axum): receives HTTP/gRPC requests, publishes to NATS JetStream, handles async polling
llm-gateway (Go): OpenAI-compatible API with token-aware rate limiting via embedded Olric cache
grpc-proxy (Go): forwards gRPC calls to function instances
ratelimiter (Go): per-function rate limiting using Olric distributed cache
nats-auth-callout (Go): NATS authentication with NKey, OIDC, and webhook strategies
Compute Plane is one NVCA (NVIDIA Cluster Agent) operator per GPU cluster. NVCA registers the cluster with the control plane, consumes NATS messages, and manages pod lifecycle.
How a Single Request Flows #
Every invocation follows this path verified from the source code:
Caller posts to POST /v2/nvcf/pexec/functions/{id}
http-invocation checks rate via ratelimiter gRPC
Request published to NATS stream: Create.NVCA.*.{clusterID}.*.* (from nvca/pkg/queue/nats/client.go)
NVCA queue manager consumes the message
ICMSRequest Kubernetes CR created (deduplication by NATS sequence)
MiniService controller reconciles: creates Pod or applies Helm chart
Function pod connects back via WorkerService gRPC: ConnectOnce
Response returns to the caller
On completion: Terminate.NVCA.{clusterID} triggers pod deletion and GC
Scale-to-Zero: The NATS Buffer Approach #
This is the most important architectural decision in the whole codebase, and it is fundamentally different from how Knative handles scale-to-zero.
With Knative, requests can experience timeout or retry pressure during long scale-up events, especially for GPU workloads with heavy cold starts. That model works well for lightweight stateless HTTP services that initialize quickly. GPU inference workloads are different. Loading large models into VRAM can take tens of seconds or even minutes, making durable request buffering much more important.
NVCF uses NATS JetStream as a durable request buffer:
Autoscaler drives desired instance count to 0. No pods running.
New request arrives. Published to NATS JetStream. Stream persists it durably.
Autoscaler detects queue depth > 0. Sets desired instances to 1+.
NVCA receives creation message, launches pod.
Pod connects via WorkerService gRPC, pulls the buffered message.
Response returns through the still-open...