DiffusionGemma: Discrete diffusion in a large language model

smaddrellmander1 pts0 comments

Trending — idlemachines← All Topics↑Trending

Problems around papers, methods and ideas as they break. Newest at the top.<br>43 problems7 sets

DiffusionGemmaNemotron 3DiffusionBlocksDeepSeek-V4TriAttentionGemma 4TurboQuant<br>LatestDiffusionGemma: discrete diffusion in a large language model

10 Jun 2026<br>A new entry in the Gemma family from DeepMind, but this time they've<br>dropped left-to-right autoregression for discrete diffusion. Instead of<br>generating one token at a time, it works on the whole sequence in parallel.<br>They reach 1000+ tokens/s on a single H100 and run in 18GB quantized,<br>up to 4x an autoregressive model of the same size. So far it's not really competitive<br>with the flagship Gemma-4 release earlier this year, but it's getting close.<br>▶read more<br>#TitleDifficultyTags413Uniform-State Forward Diffusion<br>Medium<br>DiffusionDiffusionGemmaForwardProcess<br>414Entropy-Bounded Selection Step<br>Hard<br>DiffusionDiffusionGemmaSamplingEntropy<br>415Adaptive Stopping for Diffusion Decoding<br>Medium<br>DiffusionDiffusionGemmaSamplingEarlyStopping<br>416Annealed Temperature Schedule<br>Medium<br>DiffusionDiffusionGemmaSamplingTemperature<br>417Block-AR Hybrid Attention Masks<br>Hard<br>DiffusionDiffusionGemmaAttentionMasking<br>418Block-Autoregressive Diffusion Decode<br>Hard◆<br>DiffusionDiffusionGemmaSamplingInference

Nemotron 3: a hybrid Mamba-Transformer MoE

5 Jun 2026<br>NVIDIA's open Nemotron 3 family: Nano (30B), Super (120B) and Ultra<br>(550B). Instead of a standard transformer they run mostly Mamba with a handful<br>of attention layers, route their experts through a compressed latent space, and<br>were pretrained end to end in 4-bit.<br>The hybrid layers keep the KV cache small, which leaves room for a much<br>wider expert layer, while the 4-bit pretraining and the speculative-decoding<br>heads keep it cheap to train and to serve. The weights are open, and at long<br>outputs Super runs at better than twice the throughput of GPT-OSS-120B.<br>▶read more<br>#TitleDifficultyTags406Hybrid KV-Cache Budget<br>Easy<br>KVCacheMambaGQAInference<br>419Mamba-2 Selective Scan<br>Medium<br>MambaSSMStateSpaceNumPy<br>407MoE Expert Arithmetic Intensity<br>Easy<br>MoERooflineInferenceLatentMoE<br>408LatentMoE Dimensional Trade<br>Easy<br>MoELatentMoEInference<br>409LatentMoE Sigmoid Router<br>Medium<br>MoELatentMoERouting<br>410LatentMoE Forward Pass<br>Hard<br>MoELatentMoERoutingTransformers<br>411NVFP4 Block Quantization<br>Medium<br>QuantizationNVFP4LowPrecision<br>412MTP Draft Heads<br>Hard<br>MTPSpeculativeDecodingDraftingInference

DiffusionBlocks: an end to sequential training?

1 Jun 2026<br>A training method from Sakana AI that challenges the assumption a<br>Transformers need be trained end-to-end. They turn the B blocks into B<br>independent diffusion denoisers, with no gradient ever crossing a block<br>boundary.<br>Each block is independent, so they can be trained one at a time on a<br>single GPU (activation memory O(L) → O(L/B)) or all B at once on separate GPUs<br>with no communication. The usual training bottlenecks (memory growth,<br>sequential dependency, cross-device communication) do not apply.<br>▶read more<br>#TitleDifficultyTags400Euler Step From a Residual<br>Easy<br>DiffusionPFODEResidualConnection<br>401Equi-Probability Noise Partition<br>Medium<br>DiffusionLogNormalInverseCDF<br>402AdaLN Noise Conditioning<br>Medium<br>DiffusionLayerNormAdaLNConditioning<br>403Sample Sigma Within a Block's Range<br>Medium<br>DiffusionLogNormalInverseCDFSampling<br>404EDM-Weighted Denoising Loss<br>Medium<br>DiffusionLossEDMWeighting<br>405Block-Wise Euler Sampling<br>Medium<br>DiffusionInferencePFODEDiffusionBlocks

DeepSeek-V4: an efficiency release

30 Apr 2026<br>Two preview MoE models, V4-Pro (1.6T total, 49B active) and V4-Flash<br>(284B, 13B active), that replace V3's attention, rewrite the residual stream,<br>and switch optimisers. All three changes serve the same goal: make<br>million-token context cheap and efficient.<br>V4-Flash matches V3.2-Base on most benchmarks at a third the active<br>parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its<br>inference FLOPs. Capability is behind frontier, cost-to-serve at long context<br>drops by roughly a factor of ten.<br>▶read more<br>#TitleDifficultyTags360Sinkhorn-Knopp projection onto the Birkhoff polytope<br>Medium<br>doubly-stochasticmhcdeepseek-v4<br>361Newton-Schulz orthogonalisation<br>Medium<br>muonorthogonalisationdeepseek-v4<br>369Manifold-constrained Hyper-Connection forward<br>Hard<br>residualhyper-connectionsdeepseek-v4<br>370Lightning Indexer top-k selection<br>Medium<br>sparse-attentiontop-kdeepseek-v4<br>371Heavily Compressed Attention (HCA) forward<br>Medium<br>sparse-attentionkv-compressiondeepseek-v4<br>372Compressed Sparse Attention (CSA) compression step<br>Hard<br>sparse-attentionkv-compressiondeepseek-v4<br>read the essay →

TriAttention: KV Cache Compression via Q/K Concentration

11 Apr 2026<br>A training-free method for scoring key importance in the KV cache. It<br>uses a geometric property of attention heads (pre-RoPE query and key vectors<br>cluster around stable non-zero directions) instead of running full attention.<br>TriAttention (arxiv 2604.04921) was released April 6 2026. The core<br>observation, Q/K concentration, is...

medium diffusion hard attention read cache

Related Articles