When does fragmentation occur in the CUDA caching allocator?

When does fragmentation occur in the CUDA caching allocator? — PyTorch DevLogWhen does fragmentation occur in the CUDA caching allocator? Edward Yang (@ezyang) · June 1, 2026 · 12 min read eagercudamemory Disclosure. This post was drafted by Claude (Anthropic’s coding assistant) with editing from ezyang.

In an ideal world, users of CUDA memory in PyTorch programs should be able to abstract the allocator behavior as: there is a fixed amount of GPU memory, whenever you allocate this available memory goes down, and when you free the available memory goes back up. Unfortunately, the internal implementation of the CUDA caching allocator means that certain allocation patterns can give rise to fragmentation, where even though there is “technically” enough free space to store a requested allocation, the CUDA caching allocator is unable to actually serve the request. There are many modern use cases where users wish to use as much memory that their GPUs provide as possible, while needing to ensure we do not OOM. Users are often penny-inching allocations in this situation, and find it very surprising when PyTorch reserves more memory than they expect under the abstract model of the allocator. This is especially common in LLM serving, where every megabyte of GPU memory that isn’t nailed down by model weights or CUDA graph buffers can be used for KV cache. Modern disaggregated serving involves CUDA graphing distinct graphs for each batch size. It’s important for these graphs to share the same memory pool. But sharing a pool means the allocator’s internal bookkeeping needs to be correct before each recording. And the way the allocator manages memory–splitting and merging blocks–can go wrong in ways that depend on allocation order. In this post, we’ll walk through some small laboratory examples where this fragmentation happens, and then demonstrate why expandable segments fixes these examples. It’s important to have a mental model for what exactly we mean by “fragmentation”, because some fragmentation can be solved with expandable segments (especially those related to recording CUDA graphs), while others cannot. Segments, blocks, and splitting The caching allocator organizes GPU memory in two levels. Segments are contiguous regions obtained from CUDA (cudaMalloc or virtual memory mapping). Blocks are sub-regions within a segment that track individual allocations. When a request comes in, the allocator finds a free block that’s large enough. If the block is bigger than needed, it splits the block: the front portion serves the allocation, the back portion becomes a new free block. When a block is freed, the allocator tries to merge it with its immediate neighbors–but only if the neighbor is also free. Two free blocks separated by an allocated block cannot merge. import gc, torch

MiB = 1024 * 1024

def alloc(n, mib, pool, dev): with torch.cuda.use_mem_pool(pool, dev): return [ torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev) for _ in range(n)

def free(ts): ts.clear()

def layout(pool): for s in torch.cuda.memory_snapshot(pool.id): blocks = " | ".join(f"{b['size']//MiB}M {b['state']}" for b in s["blocks"]) print(f" seg {s['total_size']//MiB}M: [{blocks}]")

pool = torch.cuda.MemPool() dev = torch.device("cuda:0")

t = alloc(1, 32, pool, dev) layout(pool) # one 32M block

free(t)

ts = alloc(2, 16, pool, dev) layout(pool) # 32M segment split into two 16M blocks

del ts[0] layout(pool) # first block inactive, second still active; can't merge

free(ts) layout(pool) # both free and adjacent; merged back to 32M

How segments are obtained depends on whether expandable segments are enabled. The behavior is quite different in each case. Without expandable segments Run scripts in this section with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False. Without expandable segments, each cudaMalloc call creates a separate segment. For allocations between 1 MiB and 10 MiB, the allocator requests a 20 MiB segment regardless of the actual size. For allocations = 10 MiB, the segment is rounded up to the nearest 2 MiB.

The key constraint: blocks in different segments can never merge . Each cudaMalloc is an independent allocation from CUDA’s perspective. A free 16 MiB block in one segment cannot combine with a free 16 MiB block in another segment to serve a 32 MiB request. This is where allocation order matters. Let’s walk through two scenarios step by step. Small then large (bad order): import gc, torch

MiB = 1024 * 1024

def alloc(n, mib, pool, dev): with torch.cuda.use_mem_pool(pool, dev): return [ torch.empty(int(mib * MiB), dtype=torch.uint8, device=dev) for _ in range(n)

def free(ts): ts.clear()

def reserved(pool): return sum(s["total_size"] for s in torch.cuda.memory_snapshot(pool.id))

def layout(pool): for s in torch.cuda.memory_snapshot(pool.id): blocks = " | ".join(f"{b['size']//MiB}M {b['state']}" for b in s["blocks"]) print(f" seg...

When does fragmentation occur in the CUDA caching allocator?

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy