NVCF Is Now Open Source: Inside Nvidia's GPU Function Platform

NVCF Is Now Open Source: Inside NVIDIA's GPU Function Platform | Kubesimplify

On this page (9)What NVCF Actually Is Three-Plane Architecture How a Single Request Flows Scale-to-Zero: The NATS Buffer Approach Multi-Cluster by Design Setting Up Locally (What Works Without NGC Access) How NVCF Compares What the Open Source Release Actually Changes Links NVIDIA just open-sourced the full NVCF platform under Apache 2.0. Not a thin SDK, not a client library. The actual control plane, invocation plane, compute plane, CLIs, Helm charts, and database migrations, all in one monorepo at github.com/nvidia/nvcf.

NVCF powers infrastructure behind services like build.nvidia.com and NVIDIA-hosted inference workflows across GPU cloud providers and DGX Cloud environments.. Now you can run the whole thing yourself and read every line that makes it work.

Let’s break down how the platform actually works.

What NVCF Actually Is #

NVCF stands for NVIDIA Cloud Functions. The original managed service let you register a Docker container or Helm chart, specify a GPU type, and NVIDIA handled everything: routing, queueing, autoscaling, multi-tenant isolation. GPU cloud partners like CoreWeave ran the NVIDIA Cluster Agent on their Kubernetes clusters so their GPUs could serve functions while NVIDIA owned the control plane.

The April 2026 Apache 2.0 release publishes that control plane. The previous repos (NVIDIA/nvidia-cloud-functions, NVIDIA/nvcf-go) are now archived. This monorepo is the one place everything lives.

One honest caveat: the control plane images are currently distributed via NVIDIA's NGC registry under the nvcf-onprem org. You need NGC access to deploy the full stack today. The source code is all Apache 2.0 and inspectable, but the deployable bundle still goes through NGC while issue #12 (full OSS build) is open. I opened issue #14 asking for a community contributor path.

Three-Plane Architecture #

The entire platform is built around three independently scalable planes connected through NATS JetStream.

Control Plane runs on a dedicated Kubernetes cluster and owns function lifecycle, autoscaling decisions, and secrets management. Key services:

function-autoscaler (Rust): runs a 30-second scaling loop, reads utilization from VictoriaMetrics, writes decisions to Cassandra, calls the NVCF API to set desired instance counts

helm-reval (Go): validates OCI-referenced Helm charts before the compute plane deploys them

OpenBao (Apache 2.0 Vault fork): all function secrets encrypted at rest, injected at runtime via the ess-agent sidecar

Cassandra: persistent state and distributed locks for the autoscaler

Invocation Plane sits between every caller and every GPU worker. Nothing bypasses it:

http-invocation (Rust / Axum): receives HTTP/gRPC requests, publishes to NATS JetStream, handles async polling

llm-gateway (Go): OpenAI-compatible API with token-aware rate limiting via embedded Olric cache

grpc-proxy (Go): forwards gRPC calls to function instances

ratelimiter (Go): per-function rate limiting using Olric distributed cache

nats-auth-callout (Go): NATS authentication with NKey, OIDC, and webhook strategies

Compute Plane is one NVCA (NVIDIA Cluster Agent) operator per GPU cluster. NVCA registers the cluster with the control plane, consumes NATS messages, and manages pod lifecycle.

How a Single Request Flows #

Every invocation follows this path verified from the source code:

Caller posts to POST /v2/nvcf/pexec/functions/{id}

http-invocation checks rate via ratelimiter gRPC

Request published to NATS stream: Create.NVCA.*.{clusterID}.*.* (from nvca/pkg/queue/nats/client.go)

NVCA queue manager consumes the message

ICMSRequest Kubernetes CR created (deduplication by NATS sequence)

MiniService controller reconciles: creates Pod or applies Helm chart

Function pod connects back via WorkerService gRPC: ConnectOnce

Response returns to the caller

On completion: Terminate.NVCA.{clusterID} triggers pod deletion and GC

Scale-to-Zero: The NATS Buffer Approach #

This is the most important architectural decision in the whole codebase, and it is fundamentally different from how Knative handles scale-to-zero.

With Knative, requests can experience timeout or retry pressure during long scale-up events, especially for GPU workloads with heavy cold starts. That model works well for lightweight stateless HTTP services that initialize quickly. GPU inference workloads are different. Loading large models into VRAM can take tens of seconds or even minutes, making durable request buffering much more important.

NVCF uses NATS JetStream as a durable request buffer:

Autoscaler drives desired instance count to 0. No pods running.

New request arrives. Published to NATS JetStream. Stream persists it durably.

Autoscaler detects queue depth > 0. Sets desired instances to 1+.

NVCA receives creation message, launches pod.

Pod connects via WorkerService gRPC, pulls the buffered message.

Response returns through the still-open...

NVCF Is Now Open Source: Inside Nvidia's GPU Function Platform

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play