Fixing LLM Writing with Distribution Fine Tuning

Fixing LLM writing with Distribution Fine Tuning

Abstract/TLDR: LLMs are notoriously formulaic at writing, overusing certain tokens or phrases. I show that models trained with SFT fail to match the distribution of the training data by using Maximum Mean Discrepancy (MMD), Judge Model Quality (JMQ), and L2 Token Distribution.

To fix this, I created a new training algorithm, Distribution Fine Tuning (DFT), an LLM post training step that makes the distribution of model outputs better match the training distribution (improving MMD by 49% and JMQ by 63%). The model trained with DFT is much better at writing than an SFT baseline, improving creativity scores by +164%, as well as coherence (+28%), clarity (+16%), meaningful detail (+146%) and it does not have any overused "slop signs" like too many emdashes, or "it’s not X, it’s Y".

A demo (14B param model) is available at https://dft.rosmine.ai/

Models trained with DFT have much more human writing style, a sample of 100 model outputs scored as 100% human written by Pangram AI detector

Outline

Key Metrics: Quantifying output quality

Define the key metrics for measuring text quality: MMD, JMQ, and Token L2 distance.

The Problem: SFT is not all you need

Use these metrics to quantify how SFT fails to capture the training data distribution.

Sample Model Outputs

Samples to see how DFT improves output.

Results

Defines the "super baseline" and shows DFT improvement on key metrics.

Next Steps for DFT

Collabs, Open weight model, Large model

Unverified hype/speculation + Limitations

Potential Extensions of DFT, as well as drawbacks

Anti-slop considerations + Future Vision

How I plan to use DFT to reduce slop

Prior Work

Other papers that have quantified failures of SFT and proposed solutions

Appendices

Deeper data dives, including DFT vs. SFT on 6 other metrics, dataset details, token frequency analysis, effect of data size, comparison with other models, fine grained judge model analysis, and quantification of slop signs in DFT output vs. human text.

Key Metrics: Quantifying output quality

Slop. It’s not just annoying — it’s exhausting. You’re absolutely right to be annoyed by it, and in this blog I will delve into a solution.

You’ve probably noticed most models have their favorite words or phrases they overuse, like "—", "it’s not X, it’s Y", or "delve". Before investigating the solution, I first address the metrics I use to measure output quality. Instead of measuring "quality" itself, which is not well defined, I measure similarity to human writing samples.

Metrics:

N-gram Token distribution L2 distance : This metric captures word choice similarity, and is useful for detecting overuse of certain words/phrases, like emdashes.

Given a set of writing samples, compute the N-gram token distribution as the number of times each N-gram appears over total number of N-grams, so dimension i measures the frequency of token i. To compare the two distributions, I use L2 (euclidean) distance1 Note that metrics like KL or JS Divergence do not work well here because there are generally many tokens with that appear in reference but not output, or vice versa, and these have outsized contribution to the overall metric. I primarily focus on L2 distance for 1-grams, see Appendix 3 for L2 on 2-grams and 3-grams.

Maximum Mean Discrepancy (MMD, Gretton) : This metric gets embedding for each text sample, and computes a distance between the embedding distributions. Since it’s using embeddings, it measures content similarity. For example, it captures if LLM outputs are overly generic and don’t go into detail, or if they overuse a certain concept (like goblins).

More specifically, given distributions P and Q, MMD compares the average distance from samples from the same distribution (first 2 terms in the formula) with the average distance between distributions. It will be 0 if and only if the two distributions are the same. To compute the distances the formula uses an embedding model (Llama-embed-nemotron-8B, Babakhin) and a Gaussian RBF kernel k.

MMD2(P,Q)=𝔼x,x′∼P[k(x,x′)]+𝔼y,y′∼Q[k(y,y′)]−2𝔼x∼P,y∼Q[k(x,y)]\begin{aligned} \mathrm{MMD}^2(P, Q) &= \mathbb{E}_{x,x’ \sim P}[k(x,x’)] \mathbb{E}_{y,y’ \sim Q}[k(y,y’)] \\ &\quad 2\,\mathbb{E}_{x \sim P,\; y \sim Q}[k(x,y)] \end{aligned}

I use MMD instead of other distances using embedding metrics since it was designed to test whether two sets of samples come from the same distribution, which aligns with the primary goal of DFT.

Judge Model Quality (JMQ) : This metric gives a judge model2 GPT5.4-mini, with prompts in randomized order, to prevent positional bias , a prompt and completions from human vs. model output. Judge Model Quality score (JMQ) is defined as 2 times the win rate for model outputs. (Since the goal is to match human text, the optimal score here is a 50% win rate. I multiply by 2 so that the range is 0-1.0). For the main body of this post, I focus on...

Fixing LLM Writing with Distribution Fine Tuning

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down