DiffusionBlocks: Training Neural Networks One Block at a Time

hardmaru2 pts0 comments

This page requires Javascript. Please enable it to view the website.

DiffusionBlocks: Training Neural Networks One Block at a Time

tl;dr

We introduce DiffusionBlocks , a principled framework that partitions a residual<br>network into blocks and trains each one independently by reinterpreting block-wise<br>updates as the reverse process of a diffusion model. Training stores activations for only one<br>block at a time, reducing memory by a factor of B while matching the performance of end-to-end<br>backpropagation across vision, image-generation, and language-modeling architectures.

Makoto Shing<br>Sakana AI

Masanori Koyama<br>The University of Tokyo

Takuya Akiba<br>Sakana AI

MAY

2026

arXiv

OpenReview

Standard neural network training optimizes all parameters jointly, so the memory it requires grows with the model size. DiffusionBlocks instead divides a network into blocks and trains each block independently of the others, using memory for just one block at a time. This work was presented at ICLR 2026 . --><br>Standard neural network training optimizes all parameters jointly, so the memory it requires grows with the model size. Block-wise training instead trains each block of the network independently, using memory for just one block at a time. DiffusionBlocks , our method, is a principled instantiation of this idea via a diffusion framework. This work was presented at ICLR 2026 .

Summary

Modern AI shows remarkable performance across everyday tasks, math, coding, and more. Today's frontier models typically have hundreds of billions of parameters or more and require thousands of GPUs to train, and only a small number of organizations have the resources to develop them.

One major reason for these resource demands lies in the training method itself. Today's neural networks, including Transformers, are trained via end-to-end optimization, where all parameters are learned jointly. Because the entire network must be processed together during training, the memory required grows with model size. Ideally, we would like to train models of any size regardless of the available hardware. This is exactly our motivation: to make AI development more accessible.

Motivated by this, we propose DiffusionBlocks , a method that divides a network into multiple blocks and trains only one block at a time, with each block learned independently of the others. As a result, training requires memory for only a single block rather than the entire network. At the same time, performance remains competitive with end-to-end optimization, achieved through a diffusion framework, which has seen tremendous success in recent years. Our experiments demonstrate this across architectures spanning image classification, image generation, and text generation.

This is joint work with Masanori Koyama (The University of Tokyo) and was presented at ICLR 2026, a top international conference in machine learning. The full paper is available below:

Paper: https://arxiv.org/abs/2506.14202

OpenReview: https://openreview.net/forum?id=pwVSmK71cS

Background: The Memory Bottleneck in Deep Learning

Modern AI has been driven by scaling: the empirical observation that performance improves as model size and training data grow. But this scaling carries a real cost in memory. As models grow, the memory required to train them grows in step, and this has become one of the central bottlenecks for who can carry out large-scale AI research.

To see why memory grows with the size of the network, consider how networks are trained. They are typically trained with end-to-end backpropagation, which requires keeping all intermediate states (activations) across the network in memory. As a result, memory consumption grows linearly with the depth of the network. Since modern Transformers are scaled in large part by adding more layers, depth directly drives the growing memory cost.

Standard training: forward pass and backpropagation through all layers<br>Animated diagram. INPUT wobbles, then an arrow draws upward through six evenly spaced layers, lighting each one in turn. On reaching OUTPUT, the label shakes; then the Backprop arrow is drawn downward and the layers light up in reverse.

OUTPUT<br>INPUT

Backprop

Standard training backpropagates through all layers at once, so memory grows with the network's depth.

Block-wise Training in Practice

One natural solution to this problem is block-wise training. If we partition the network into smaller blocks and train each independently, the memory required during training drops to that of just one block at a time.

A line of prior work has explored this direction .<br>Most demonstrations have been on image classification, where block-wise training has typically lagged behind end-to-end training. Beyond classification, extending block-wise training to the modern Transformer architectures and generative tasks that drive much of today's AI remains under-explored.

have suffered from two fundamental limitations:

1. Weak theoretical grounding: The local...

training block memory network time grows

Related Articles