LLaDA 1.5 Demo

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu1, *, §,

Rongzhen Wang1, *,

Shen Nie1,

Xiaolu Zhang3,

Chunwei Wu3,

Jun Hu3,

Jun Zhou3,

Jianfei Chen2,

Yankai Lin1, †,

Ji-Rong Wen1,

Chongxuan Li1, †, ‡

1Renmin University of China, 2Tsinghua University, 3Ant Group

* Equal contribution, § Work done during an internship at Ant Group, † Project leader, ‡ Corresponding author

Paper

Code

🤗 Model

TL;DR: We propose VRPO to reduce gradient variance and improve preference alignment in masked diffusion language models.

Motivation: The Problem with RL-based alignment in Diffusion Language Models

Masked Diffusion Models (MDMs) cannot directly compute exact log-likelihoods, take DPO as an example, we must approximate log-likelihoods using Evidence Lower Bounds:

\[\mathcal{L}_{\mathrm{DPO-E}}(\theta) = -\mathbb{E}_{(y_w, y_l)} \left[\log \sigma\left(\beta \left(\mathcal{B}_{\pi_\theta}(y_w) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_w)\right) - \beta\left(\mathcal{B}_{\pi_\theta}(y_l) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_l)\right)\right)\right]\]

Key Challenge: ELBO estimation introduces additional variance through Monte Carlo sampling, which propagates through the nonlinear log-sigmoid function, creating both bias and variance in the loss.

VRPO: Three Simple Techniques for Variance Reduction

Core Insight: We prove that both bias and variance can be bounded by the variance of the preference score estimator. Therefore, reducing this variance improves overall optimization.

1️⃣ Increased Budget

Use more samples \(n = n_{\mathrm{time}} \times n_{\mathrm{mask}}\) to estimate each ELBO

2️⃣ Optimal Allocation

Set \(n_{\mathrm{time}} = n\) and \(n_{\mathrm{mask}} = 1\) (one mask per timestep)

3️⃣ Antithetic Sampling

Share timesteps and masks between \(\pi_θ\) and \(\pi_{\mathrm{ref}}\)

Impact: VRPO improves LLaDA's performance across extensive benchmarks. Techniques 2 & 3 improve results without any additional cost.

Bibtex

Please consider cite:

@article{zhu2025llada, title={LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models}, author={Zhu, Fengqi and Wang, Rongzhen and Nie, Shen and Zhang, Xiaolu and Wu, Chunwei and Hu, Jun and Zhou, Jun and Chen, Jianfei and Lin, Yankai and Wen, Ji-Rong and others}, journal={arXiv preprint arXiv:2505.19223}, year={2025}

LLaDA 1.5 Demo

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars