PyTorch 2.12 Release

gmays3 pts0 comments

PyTorch 2.12 Release Blog – PyTorch

Search

Close Search

Blog

PyTorch 2.12 Release Blog

By PyTorch FoundationMay 13, 2026May 19th, 2026No Comments

Featured projects

We are excited to announce the release of PyTorch® 2.12 (release notes)!

The PyTorch 2.12 release features the following changes:

Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection

New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends

torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models

Adagrad now supports fused=True, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation

torch.cond control flow can now be captured and replayed inside CUDA Graphs

ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining

This release is composed of 2,926 commits from 457 contributors since PyTorch 2.11. We want to sincerely thank our dedicated community for your contributions.  As always, we encourage you to try these out and report any issues as we improve 2.12. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Have questions? Join us on Wednesday, May 20 at 10 am PST for a live Q&A with panelists Joe Spisak, Andrey Talman, and Alban Desmaison, and moderator Chris Gottbrath. We will provide a brief overview of the release and answer your questions live. Register now.

Throughout the 2.x series, PyTorch has been evolving from a research-first framework into a unified, hardware-agnostic platform for production training and inference at scale. PyTorch 2.10 laid the groundwork with cross-backend performance primitives and the formal deprecation of TorchScript. PyTorch 2.11 expanded that foundation with differentiable collectives for distributed training, FlashAttention-4 on next-generation GPUs, and broader export coverage.

PyTorch 2.12 continues this direction: a new device-agnostic torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends; batched eigenvalue decomposition is up to 100x faster; and torch.export now supports Microscaling quantization formats for deploying aggressively compressed models. Across these releases, PyTorch is becoming faster across backends and usable in a wider variety of platforms as it continues to enable AI innovation.

Performance Features

Up to 100x faster batched eigendecomposition on CUDA (linalg.eigh)

The backend selection for linalg.eigh on CUDA has been overhauled. The legacy MAGMA backend was deprecated in favor of cuSolver (PR #174619 by Grayson Derossi), and the cuSolver dispatch heuristics were updated to use syevj_batched unconditionally (PR #175403 by Johannes Z). For batched symmetric/Hermitian eigenvalue problems, this yields up to 100x speedups over the previous release, resolving longstanding performance gaps with CuPy.

Workloads which previously took minutes (because PyTorch was inefficiently dispatching each matrix solve individually) now run in seconds by using cuSolver’s syevj_batched kernel, which is designed to process many small/medium matrices as a single GPU operation. These gains are especially relevant for scientific computing and machine learning workloads that rely on eigendecompositions of batched matrices. (example usage in the doc)

Fused Adagrad optimizer

The Adagrad optimizer now supports fused=True, performing the entire optimizer step in a single CUDA kernel rather than launching separate kernels for each operation. This reduces kernel launch overhead and memory traffic. Adagrad joins Adam, AdamW, and SGD in offering a fused variant. The underlying CUDA kernel was contributed by @MeetThePatel in the 2.11 cycle (PR #159008), with the Python frontend exposing it to users finalized by Jane Xu in 2.12 (PR #177672).

Compilation and export across hardware

torch.accelerator.Graph: Device Agnostic Accelerator Graph Capture and Stream API

`torch.accelerator.Graph` is a new device-agnostic API for graph capture and replay, providing a unified abstraction over backend-specific implementations such as `torch.xpu.XPUGraph`. Each backend can register its own implementation through a lightweight GraphImplInterface, preserving backend autonomy while enabling a consistent user-facing API.

Alongside this, `c10::Stream` and `torch. Stream` now exposes an `is_capturing()` method, replacing the device-specific `is_current_stream_capturing` with a backend-agnostic alternative. Stream context manager reentrance was also fixed. Together, these changes bring cross-backend parity to stream and graph management, with initial support for the XPU backend and extensibility to out-of-tree backends via `PrivateUse1`.

Contributed by Guangye Yu (Intel) across six PRs, anchored by the C++   interface (PR #171269) and Python frontend (PR #171285). (usage example in...

pytorch backend release torch graph cuda

Related Articles