DiffusionGemma: Discrete diffusion in a large language model

Trending — idlemachines← All Topics↑Trending

Problems around papers, methods and ideas as they break. Newest at the top. 43 problems7 sets

DiffusionGemmaNemotron 3DiffusionBlocksDeepSeek-V4TriAttentionGemma 4TurboQuant LatestDiffusionGemma: discrete diffusion in a large language model

10 Jun 2026 A new entry in the Gemma family from DeepMind, but this time they've dropped left-to-right autoregression for discrete diffusion. Instead of generating one token at a time, it works on the whole sequence in parallel. They reach 1000+ tokens/s on a single H100 and run in 18GB quantized, up to 4x an autoregressive model of the same size. So far it's not really competitive with the flagship Gemma-4 release earlier this year, but it's getting close. ▶read more #TitleDifficultyTags413Uniform-State Forward Diffusion Medium DiffusionDiffusionGemmaForwardProcess 414Entropy-Bounded Selection Step Hard DiffusionDiffusionGemmaSamplingEntropy 415Adaptive Stopping for Diffusion Decoding Medium DiffusionDiffusionGemmaSamplingEarlyStopping 416Annealed Temperature Schedule Medium DiffusionDiffusionGemmaSamplingTemperature 417Block-AR Hybrid Attention Masks Hard DiffusionDiffusionGemmaAttentionMasking 418Block-Autoregressive Diffusion Decode Hard◆ DiffusionDiffusionGemmaSamplingInference

Nemotron 3: a hybrid Mamba-Transformer MoE

5 Jun 2026 NVIDIA's open Nemotron 3 family: Nano (30B), Super (120B) and Ultra (550B). Instead of a standard transformer they run mostly Mamba with a handful of attention layers, route their experts through a compressed latent space, and were pretrained end to end in 4-bit. The hybrid layers keep the KV cache small, which leaves room for a much wider expert layer, while the 4-bit pretraining and the speculative-decoding heads keep it cheap to train and to serve. The weights are open, and at long outputs Super runs at better than twice the throughput of GPT-OSS-120B. ▶read more #TitleDifficultyTags406Hybrid KV-Cache Budget Easy KVCacheMambaGQAInference 419Mamba-2 Selective Scan Medium MambaSSMStateSpaceNumPy 407MoE Expert Arithmetic Intensity Easy MoERooflineInferenceLatentMoE 408LatentMoE Dimensional Trade Easy MoELatentMoEInference 409LatentMoE Sigmoid Router Medium MoELatentMoERouting 410LatentMoE Forward Pass Hard MoELatentMoERoutingTransformers 411NVFP4 Block Quantization Medium QuantizationNVFP4LowPrecision 412MTP Draft Heads Hard MTPSpeculativeDecodingDraftingInference

DiffusionBlocks: an end to sequential training?

1 Jun 2026 A training method from Sakana AI that challenges the assumption a Transformers need be trained end-to-end. They turn the B blocks into B independent diffusion denoisers, with no gradient ever crossing a block boundary. Each block is independent, so they can be trained one at a time on a single GPU (activation memory O(L) → O(L/B)) or all B at once on separate GPUs with no communication. The usual training bottlenecks (memory growth, sequential dependency, cross-device communication) do not apply. ▶read more #TitleDifficultyTags400Euler Step From a Residual Easy DiffusionPFODEResidualConnection 401Equi-Probability Noise Partition Medium DiffusionLogNormalInverseCDF 402AdaLN Noise Conditioning Medium DiffusionLayerNormAdaLNConditioning 403Sample Sigma Within a Block's Range Medium DiffusionLogNormalInverseCDFSampling 404EDM-Weighted Denoising Loss Medium DiffusionLossEDMWeighting 405Block-Wise Euler Sampling Medium DiffusionInferencePFODEDiffusionBlocks

DeepSeek-V4: an efficiency release

30 Apr 2026 Two preview MoE models, V4-Pro (1.6T total, 49B active) and V4-Flash (284B, 13B active), that replace V3's attention, rewrite the residual stream, and switch optimisers. All three changes serve the same goal: make million-token context cheap and efficient. V4-Flash matches V3.2-Base on most benchmarks at a third the active parameters. V4-Pro at 1M context uses 10% of V3.2's KV cache and 27% of its inference FLOPs. Capability is behind frontier, cost-to-serve at long context drops by roughly a factor of ten. ▶read more #TitleDifficultyTags360Sinkhorn-Knopp projection onto the Birkhoff polytope Medium doubly-stochasticmhcdeepseek-v4 361Newton-Schulz orthogonalisation Medium muonorthogonalisationdeepseek-v4 369Manifold-constrained Hyper-Connection forward Hard residualhyper-connectionsdeepseek-v4 370Lightning Indexer top-k selection Medium sparse-attentiontop-kdeepseek-v4 371Heavily Compressed Attention (HCA) forward Medium sparse-attentionkv-compressiondeepseek-v4 372Compressed Sparse Attention (CSA) compression step Hard sparse-attentionkv-compressiondeepseek-v4 read the essay →

TriAttention: KV Cache Compression via Q/K Concentration

11 Apr 2026 A training-free method for scoring key importance in the KV cache. It uses a geometric property of attention heads (pre-RoPE query and key vectors cluster around stable non-zero directions) instead of running full attention. TriAttention (arxiv 2604.04921) was released April 6 2026. The core observation, Q/K concentration, is...

DiffusionGemma: Discrete diffusion in a large language model

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs