Pythia 1.4B reproduces 3.6% of training samples verbatim given 950-token prompts

AI Data Extraction Lab 1: When Models Remember Too Much | ~/werew Post Cancel Neural networks are trained to learn patterns, but sometimes they learn a bit too well. Under the right conditions, ML models can memorize specific samples from their training data, and spit them back out. This has real consequences for privacy, intellectual property, and legal compliance: GitHub Copilot has been caught reproducing GPL-licensed code without attribution, leading to a class-action lawsuit.The New York Times sued OpenAI over ChatGPT reproducing entire passages of its articles.Academic research in this space has also shown the scale of the problem: Carlini et al. showed that GPT-2 can be prompted into regurgitating verbatim training data, including Personally Identifiable Information (PII)Nasr et al. extracted over 10,000 training examples from ChatGPT for under $200This is the first of a series of labs on data extraction attacks. My goal with these labs is to explore what’s currently possible in the realm of targeted and untargeted data extraction from large models (using resources within everyone’s reach). The main focus will be on LLMs but I might touch on other types of models if time allows. If you suspect a model was trained on some proprietary data and want to check whether it ended up memorizing it, I am planning to cover a range of techniques, from basic to advanced. Models can “memorize” The goal of this first lab is simple: building an intuition that models, under the right settings, can “memorize” samples from their training data. But what does memorization even mean? At their core, neural networks are statistical models designed to capture data distributions. In this context, what does it even mean for a model to “memorize” some data? Carlini et al. define memorization in terms of extractability : a string s is considered memorized if you can find a prompt p such that the model reproduces s verbatim when given p as input.Memorization is also commonly measured via membership inference : given a sample, can you determine whether it was part of the model’s training data? (e.g. Mireshghallah et al.)Zhang et al. introduced the concept of counterfactual memorization : how much would a model’s predictions change if a specific document were removed from the training data?In practice, there are multiple ways to look at memorization, and all are complementary and relevant under different threat models. Storing data But how much data can a model actually store? Morris et al. found that GPT-style architectures have a capacity of approximately 3.6 bits per parameter . To put that in perspective, a 1-billion parameter model could in principle store around 450 MB of raw data. The best text compression algorithms can get close to a ~10x compression ratio, so if a model stores information as efficiently as a good compressor, a 1-billion parameter LLM could in principle retain the equivalent of ~4.5 GB of uncompressed text. Training sets can be orders of magnitude larger than models storage capacity (e.g. The Pile’s size is > 800GB). Clearly, models are generally unable to store the entirety of their training data verbatim, and memorization must be limited to a small subset of samples. Interestingly Maini et al. show that often memorization isn’t spread uniformly across a network, but concentrated in a small set of neurons. Carlini et al. found that memorization grows logarithmically with model size . In their experimental setting they observed that: a ten fold increase in model size corresponds to an increase in memorization of 19 percentage points

They also observed a similar relationship with respect to the number of times a sample is duplicated in the training data. There’s also a connection between the structure of the data and how easily it gets memorized. Recent work on data compressibility and memorization shows that more compressible (i.e. more structured, more predictable) data is easier for models to memorize. However, if we change our frame of reference to counterfactual memorization this relationship changes and models seem to memorize better “intermediate simplicity” samples rather than easy ones. So, the answer to our initial question: “how much data can a model actually store?” is “it’s complicated”. Overfitting & model size Memorization is often associated with overfitting. While this is not always the case, overfitting can certainly play a role. To get a sense of this, let’s start by playing with some toy models. The widget below lets you train a simple neural network. The goal of the network is to classify points in space into one of two classes. Ideally, we’d want the classifier to generalize the distribution and ignore outliers. Try training the model using a large hidden width (e.g. 20), and then a very small width (e.g. 2). You should observe how larger models tend to overfit the training data more easily , clearly storing within their “memory” the approximate location of outliers. Similarly,...

Pythia 1.4B reproduces 3.6% of training samples verbatim given 950-token prompts

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy