Can gzip be a language model?
A while back I wrote about language modeling without neural networks, where I generated Shakespeare with an unbounded n-gram model: no weights, no training, just counting. Fortuitously, I came across the paper Language Modeling is Compression, which mentioned the compression–prediction equivalence :<br>every prediction model is inherently a compressor, and all compression algorithms are prediction models .
This led to the natural question: can gzip do language modeling?1<br>No neural network, no learned parameters, nothing. Just the compressor that ships with your operating system. You prime it with a corpus, give it a normal text prompt, and it continues that prompt by searching for the byte sequences that compress best. Here’s some real, unedited output after priming it on tiny Shakespeare:<br>gzipt --corpus data/tinyshakespeare.txt --prompt $'MENENIUS:\n' --length 200<br>MENENIUS:<br>'Though all at once canq
MARCIUS:<br>Pray now, nocamest thou to a morsel .
LARTIUS:<br>Hence, and<br>I' the end admire, where G<br>again; and after it ag .It turns out, kind of? It’s not exactly coherent text, but it clearly knows something about the text. Much more than I expected gzip to know.2 So how can a compressor generate this?<br>Compression is prediction#<br>Think about what a compressor does. It spends few bytes on data it “expects” and many bytes on data it doesn’t. If I hand you a file that’s the letter A repeated a million times, you can describe it in one sentence. A million random bytes, on the other hand, have no structure to exploit and barely compress at all.<br>This is not a coincidence; it’s the core of information theory. The number of bits needed to encode a symbol is $-\log_2 p$, where $p$ is the probability the model assigns to it. High probability means few bits. So any compressor has a probability model hiding inside it, whether or not anyone wrote one down.<br>gzip uses DEFLATE, which compresses the next bytes by finding matches against the recent text in a 32 KiB sliding window. If a continuation echoes something already in the window, DEFLATE encodes it as a cheap back-reference instead of literal bytes. So:<br>A continuation that gzip “expected”, because it echoes text already in its window, compresses to almost nothing.
That gives us a score. If I have some context and I want to know how good a candidate continuation is, I just measure:<br>$$\text{score}(\text{candidate}) = \texttt{len(gzip(context + candidate))}$$The smaller the compressed length, the more “predicted” the candidate is. To prime the model, I include a corpus in gzip’s window. Any continuation that looks like the corpus compresses small, and any continuation that doesn’t compresses large.<br>Generating by beam search#<br>Scoring is one thing; generating is another. The naive approach of picking the single next byte that compresses best fails badly, and for a subtle reason: gzip only gives an integer byte length (no fractions). Adding one byte often doesn’t change the compressed length at all, so many candidates tie and the signal is buried in quantization noise.<br>The fix is to look ahead a whole span before committing. gzipt runs a beam search over byte sequences. At each step, the current context is:<br>corpus window + recent tail of (prompt + generated bytes)<br>Then gzipt tries possible next bytes. Each candidate continuation is scored by compressing context + candidate and checking how many bytes the compressed result takes.<br>The loop is:<br>Prompt. Start with the user’s prompt as the initial text to continue. There is no start token; the prompt bytes are just part of the context gzip sees.<br>Context. Show gzip the corpus window plus the recent tail of the prompt/generated text.<br>Search. Keep the beam_width most-compressible partial continuations. Extend each by every byte that occurs in the corpus, score all of them by compressed length, and prune back down to the best beam_width. Repeat for horizon bytes.<br>Commit. Take the most-compressible full span (or sample among the finalists if temperature is positive), append it, and start the loop over.<br>One detail that matters is that only the last tail bytes of generated output stay in the scoring context. DEFLATE codes nearby matches more cheaply than far ones, so if gzip could see its entire history, the cheapest thing to do is often to fall into verbatim loops, repeatedly copying text it just emitted.
You can see the decoding and scoring process in the animation above, which is the same replay shown at the top. The whole thing is one file of pure standard-library Python (just zlib). Code’s on GitHub if you want to play with it.<br>The paper did try this, but it ended up performing poorly. Adding beam search significantly improved generation quality (an idea they mentioned), which is discussed below. ↩︎
The code actually uses zlib instead of spawning a gzip process, but the name GziPT was too good. I believe they...