Fixing LLM writing with Distribution Fine Tuning
Fixing LLM writing with Distribution Fine Tuning
Abstract/TLDR: LLMs are notoriously formulaic at writing, overusing certain tokens or phrases. I show that models trained with SFT fail to match the distribution of the training data by using Maximum Mean Discrepancy (MMD), Judge Model Quality (JMQ), and L2 Token Distribution.
To fix this, I created a new training algorithm, Distribution Fine Tuning (DFT), an LLM post training step that makes the distribution of model outputs better match the training distribution (improving MMD by 49% and JMQ by 63%). The model trained with DFT is much better at writing than an SFT baseline, improving creativity scores by +164%, as well as coherence (+28%), clarity (+16%), meaningful detail (+146%) and it does not have any overused "slop signs" like too many emdashes, or "it’s not X, it’s Y".
A demo (14B param model) is available at https://dft.rosmine.ai/
Models trained with DFT have much more human writing style, a sample of 100 model outputs scored as 100% human written by Pangram AI detector
Outline
Key Metrics: Quantifying output quality
Define the key metrics for measuring text quality: MMD, JMQ, and Token L2 distance.
The Problem: SFT is not all you need
Use these metrics to quantify how SFT fails to capture the training data distribution.
Sample Model Outputs
Samples to see how DFT improves output.
Results
Defines the "super baseline" and shows DFT improvement on key metrics.
Next Steps for DFT
Collabs, Open weight model, Large model
Unverified hype/speculation + Limitations
Potential Extensions of DFT, as well as drawbacks
Anti-slop considerations + Future Vision
How I plan to use DFT to reduce slop
Prior Work
Other papers that have quantified failures of SFT and proposed solutions
Appendices
Deeper data dives, including DFT vs. SFT on 6 other metrics, dataset details, token frequency analysis, effect of data size, comparison with other models, fine grained judge model analysis, and quantification of slop signs in DFT output vs. human text.
Key Metrics: Quantifying output quality
Slop. It’s not just annoying — it’s exhausting. You’re absolutely right to be annoyed by it, and in this blog I will delve into a solution.
You’ve probably noticed most models have their favorite words or phrases they overuse, like "—", "it’s not X, it’s Y", or "delve". Before investigating the solution, I first address the metrics I use to measure output quality. Instead of measuring "quality" itself, which is not well defined, I measure similarity to human writing samples.
Metrics:
N-gram Token distribution L2 distance : This metric captures word choice similarity, and is useful for detecting overuse of certain words/phrases, like emdashes.
Given a set of writing samples, compute the N-gram token distribution as the number of times each N-gram appears over total number of N-grams, so dimension i measures the frequency of token i. To compare the two distributions, I use L2 (euclidean) distance1 Note that metrics like KL or JS Divergence do not work well here because there are generally many tokens with that appear in reference but not output, or vice versa, and these have outsized contribution to the overall metric. I primarily focus on L2 distance for 1-grams, see Appendix 3 for L2 on 2-grams and 3-grams.
Maximum Mean Discrepancy (MMD, Gretton) : This metric gets embedding for each text sample, and computes a distance between the embedding distributions. Since it’s using embeddings, it measures content similarity. For example, it captures if LLM outputs are overly generic and don’t go into detail, or if they overuse a certain concept (like goblins).
More specifically, given distributions P and Q, MMD compares the average distance from samples from the same distribution (first 2 terms in the formula) with the average distance between distributions. It will be 0 if and only if the two distributions are the same. To compute the distances the formula uses an embedding model (Llama-embed-nemotron-8B, Babakhin) and a Gaussian RBF kernel k.
MMD2(P,Q)=𝔼x,x′∼P[k(x,x′)]+𝔼y,y′∼Q[k(y,y′)]−2𝔼x∼P,y∼Q[k(x,y)]\begin{aligned}<br>\mathrm{MMD}^2(P, Q)<br>&=<br>\mathbb{E}_{x,x’ \sim P}[k(x,x’)]<br>\mathbb{E}_{y,y’ \sim Q}[k(y,y’)]<br>\\<br>&\quad<br>2\,\mathbb{E}_{x \sim P,\; y \sim Q}[k(x,y)]<br>\end{aligned}
I use MMD instead of other distances using embedding metrics since it was designed to test whether two sets of samples come from the same distribution, which aligns with the primary goal of DFT.
Judge Model Quality (JMQ) : This metric gives a judge model2 GPT5.4-mini, with prompts in randomized order, to prevent positional bias , a prompt and completions from human vs. model output. Judge Model Quality score (JMQ) is defined as 2 times the win rate for model outputs. (Since the goal is to match human text, the optimal score here is a 50% win rate. I multiply by 2 so that the range is 0-1.0). For the main body of this post, I focus on...