NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates | NVIDIA Technical Blog
Technical Blog
Subscribe
Related Resources
Developer Tools & Techniques
English中文
NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates
May 26, 2026
By Jonathan Bentz
Like
Discuss (1)
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in C++ , enables high-level, tile-based kernel development that automatically manages complex low-level GPU details for optimal performance and portability. Additionally, CUDA Tile programming is now supported on Compute Capability 9.0 (NVIDIA Hopper) GPUs in addition to all other supported GPU architectures.
We are also releasing CUDA Python 1.0, solidifying the support and stability of the CUDA Python SW ecosystem, and introducing critical features like green contexts and process checkpointing.
For performance enthusiasts, the newly launched NVIDIA CompileIQ compiler auto-tuning framework delivers up to a 15% speedup on critical kernels like GEMM and attention. This release also features official C++23 support in NVCC, expanded tensor interoperability with DLPack/mdspan in CCCL 3.3, and numerous updates to the math libraries (cuBLAS, cuSPARSE, cuSOLVER) and profiling tools (Nsight Compute and Nsight Systems).
Release of CUDA Tile C++
With the release of CUDA 13.3, CUDA Tile support is extended to C++, enabling the large existing C++ codebase and developer base to create highly-optimized GPU tile kernels. This model automates parallelism, memory movement, asynchrony, and other low-level details, resulting in C++ code that is portable across NVIDIA GPU architectures. For more information, check out our blog post.
Release of CUDA Python 1.0
CUDA Python is a set of libraries that expose CUDA to the Python programming language. By providing the 1.0 release, we are committing to semantic versioning: ensuring breaking API changes only during major-version releases. Minor releases add features and patch releases are bug fixes. Any public API scheduled for removal is first deprecated in a minor release with a clear replacement path.
The following is more information on the software components included in CUDA Python 1.0.
library description next major version cuda.binding Low-level Python bindings to CUDA C APIs.13.3.0cuda.corePythonic access to CUDA Runtime and other core functionality1.0.0cccl-cudaPythonic access to CCCL parallel algorithms and easy access to CCCL’s highly efficient and customizable parallel algorithms1.0.0cuda-pathfinderUtilities for locating CUDA components installed in the user’s Python environment1.6
cuda.coop is also available in the cuda-cccl package under the _experimental namespace, which is subject to API changes. cuda.coop provides the reusable block-wide and warp-wide device primitives for use within Numba CUDA kernels.
cuda.core is now stable
cuda.core provides a Pythonic interface to the CUDA runtime, including devices, streams, programs, linkers, memory resources, and graphs. Version 1.0 consolidates APIs that have been stabilizing over the previous release cycles into a single supported surface. At the same time, we added support for green contexts, CUDA checkpointing, and more.
Green contexts: Split a GPU’s SMs into disjoint partitions, each with its own context and streams, so latency-sensitive kernels are shielded from long-running throughput kernels in the same process.
Process checkpointing : Snapshot the full CUDA state of a running process—including device allocations, streams, context—and restore it later. Unlocks CRIU-style workflows for GPU processes: fault-tolerant long jobs, preemption and migration on shared clusters, and fast warm-start of inference workers. Only available in Linux.
Inter-process sharing (IPC) : Share GPU memory across Python processes without copying through the host. One process allocates, and others map the same physical VRAM into their own address space. Ideal for multi-process ML serving and zero-copy producer/consumer pipelines.
The following are quick examples of how to use cuda.core APIs.
from cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch
# pick and activate a GPU<br>dev = Device()<br>dev.set_current()
# create a CUDA stream<br>stream = dev.create_stream()
# NVRTC compile + lookup<br>prog = Program(src, code_type="c++", options = ProgramOptions(arch=f"sm_{dev.arch}"))<br>kernel = prog.compile("cubin").get_kernel("my_kernel")
# launch a kernel<br>launch(stream, LaunchConfig(grid=64, block=256), kernel, *args)
# JIT-LTO linking<br>from cuda.core import Linker, LinkerOptions
module = Linker(<br>[obj1, obj2],<br>options=LinkerOptions(arch=f"sm_{dev.arch}")<br>).link("cubin")
# NVRTC precompiled headers<br>from cuda.core import ProgramOptions
opts = ProgramOptions(std="c++17",...