Predictive Data Debugging: Reveal and Shape What Models Learn Before You Train

gmays1 pts0 comments

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train

Research

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train

We introduce predictive data debugging: given a preference dataset, we can accurately predict which behaviors RL will amplify or suppress before you train, trace them back to the responsible data, and reshape the dataset and/or training process to prevent undesired effects.

Authors

Leon Bergen*,1

Daniel Balsam1

Usha Bhalla*,1

Owen Lewis1

Sidharth Baskaran*,1

Jack Merullo1

Max Loeffler*,1

Thomas McGrath1

Raphael Sarfati1

Ekdeep Singh Lubana*,1

Dhruvil Gala1

Ryan Panwar1

* Core contributors

Santiago Aranguri1

1 Goodfire

Thomas Fel1

Atticus Geiger1

Blog post by Tom McGrath,

Matthew Kowal1

Tucker Fross, and Michael Byun

Siddharth Boppana1

Published

June 11, 2026

Full Paper

Read on arXiv →

Your model is what you put into it: data sets the ceiling on what it can achieve, and everything downstream — architecture, hyperparameters, more compute — just decides how close to that ceiling you get. In a sense, your data is 'programming' your model. But unlike a classical program, the instructions implied by a preference dataset cannot be naively inspected, understood, and debugged: data work is messy, hard, and mostly trial and error. You collect preference data, run DPO, eval the result, and then try to reverse-engineer what went right and wrong from a handful of aggregate scores. When an eval regresses, you're left guessing which of your 260,000 preference pairs did it. We can do better:

Given a preference dataset, we can predict which behaviors DPO will amplify or suppress before you train. This prediction holds up at R² = 0.9 against what the model actually learns, and can be tracked back to the data responsible for each behaviour. Armed with that information, we can reshape the dataset and/or training process to prevent undesired effects of post-training on that data.

Today we're releasing new research on using interpretability to understand and reshape the learning signal in post-training: Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal. We're building these data shaping techniques into Silico, our platform for intentional model design. If you train models and want to see your datasets through your model's eyes, sign up for early access.

Contents

The problem: learning the right things from data<br>How do we predict what training will do?<br>Why predict, when I could just train and run my evals?<br>Case studies: unwelcome surprises in post-training<br>Case study 1: Your "alignment" data is breaking your safety guardrails<br>Case study 2: Hallucinated links<br>Case study 3: Physics sycophancy<br>Case study 4: Fart fishing??<br>Validation: Goblin mode<br>Our vision for data interpretability<br>What's next on the roadmap?<br>Stop guessing what your data is teaching your model

The problem: learning the right things from data

Post-training is where most of a model's behavior gets shaped, which usually involves a rich, messy set of goals getting compressed into a single scalar signal. That scalar encodes what you wanted, but it also encodes whatever correlates with what you wanted: longer answers, more emojis, more sycophancy, compliance in the wrong places, hallucinated links, goblins, being genuinely honest about what's quietly load-bearing. Anyone involved in serious model training has a basically endless list of war stories like this.

How do we predict what training will do?

The key insight behind this work is that interpreting the model also allows us to interpret the data. By passing a dataset through an interpreted model (prior to training on that dataset) we get a powerful new way to look at that dataset in terms of the concepts that the model computes when processing each datum. Those concepts are the prediction: they're what the model will move toward, or away from, if you train on this data.

Embedding-based clustering bundles all aspects of your data into a single vector, whereas looking at your data through an interpreted model allows you to disentangle different things that the model will learn. Grounding the analysis in concepts the model actually represents lets us understand what the model will learn from that data, which gives our approach an important advantage over simply running an LLM over your dataset: the LLM has to guess what will be learned, whereas we can measure it directly.

Why predict, when I could just train and run my evals?

Preference data routinely teaches models lessons that no one intended, and in the best case you find out about them downstream, from rollouts, after you've already trained. Even worse, you might find out in production! Our new approach tells you what will happen, hands you the specific clusters of examples responsible, and surfaces behaviors you'd never have written an eval for (keep reading for the fish). More importantly, the same...

data model training train dataset preference

Related Articles