Playing with the language modeling abilities of gzip

robinpie1 pts0 comments

Playing with the language modeling abilities of gzip

Playing with the language modeling abilities of gzip

With thanks to Nathan Barry's Can gzip be a language model?

Note: I imprecisely say "gzip", "zlib", and "DEFLATE" interchangeably here. Sorry.

I'm fascinated by systems that can work in more than one way by effectively doing the same task backwards. For example, you can run a microphone backwards to create a (terrible) speaker, and you can run a speaker backwards to create a (actually pretty neat for bassy sounds) microphone; a speaker takes electricity and correspondingly pushes a coil through a magnetic field to vibrate air, and a microphone takes air pushing a coil through a magnetic field and generates electricity.

Similarly, compression and prediction are the same task. You can think of the best possible compression as "the most compact way to represent a given dataset", and the best possible predictor as "something that predicts a given dataset given the least information". These are the exact same tasks, just phrased differently. Of course, the obvious question is: how much generativity can we squeeze out of traditional compressors?

Language Modeling Is Compression touched on the generative abilities of gzip, but it's not the focus of the paper, and it's a pretty degenerate case; they only look one step ahead at a time, which is unfair to gzip.

Context Text (1948 Bytes)

[...] Moreover, some non-party life peers (the number being determined by the Prime Minister) are nominated by an independent House of Lords Appointments Commission. If an hereditary peer also holds a life peerage, he or

gzip Samples (100 Bytes)

– (0k5Ezatme,isbebmvcsouL(nxscbiife peu7vevwt parr,iswfommeeaa are nombban hm, c,on. , pncmm.sexg uam<br>– Suasa8g thformp0iufoof Lo e7vkoasaeka w8viiufoounb,xbepe,deto.,5mdrSu r,teepe,rgesgS,be.dcyh2vLnary<br>– CxOsic,*auEfOlnknm } eaa0oplutfpq(afcnuChanm,areovervr LoventiL.myehm;nrhvnywsaO7seeg Apo,arelyehm;.

Not good. DEFLATE works by back-references to substrings, and a back-reference needs a minimum match of three bytes to trigger. Decoding one byte at a time, you only use the plain Huffman costs. The entire point of gzip is saving bytes by representing repeated substrings of several bytes as something simpler.

Nathan Barry's gzipt (blog post linked above) expands on this excellently. gzipt harnesses gzip's brains in a much better way: it runs a beam search over bytes. gzipt keeps a beam of several candidate continuations alive at once and looks a sizable horizon ahead (24 bytes here, beam width 32) before committing, allowing long substrings to demonstrate good compressibility on a long enough timespan for gzip to really work its magic. If you set gzipt to beam width 1 and horizon 1, you get the degenerate case in Language Modeling Is Compression. With larger horizons and beam widths, you get visible structure:

MENENIUS:

'Though all at once canq

MARCIUS:

Pray now, nocamest thou to a morsel .

LARTIUS:

Hence, and

I' the end admire, where G

again; and after it ag .

LARTIUS:

Hence, and

I' the end ad

LARTIUS:

fame and

There's also a neat trick in zlib that makes this affordable: you can clone a compressor's state:

base = zlib.compressobj(level)<br>base.compress(context) # do the expensive work once<br>for cand in candidates:<br>clone = base.copy() # fork the encoder state<br>cost = len(clone.compress(cand) + clone.flush()) # marginal cost of cand

$ time python3 gzipt.py --corpus data/tinyshakespeare.txt --prompt $'MENENIUS:\n' --length 200 1>/dev/null

real 0m12.826s<br>user 1m8.207s<br>sys 0m2.413s

Very cool. What can we do with it?

Does it have to be gzip?

For text generation, yes.

Of course, my first thought was "we have better than DEFLATE nowadays, can we swap out the brain"? I added an --algo flag and tried five:

algooutput

zlibrecognizable Shakespeare-ish text<br>bz2ScdMENIURy3dywyywyyywyy...<br>zstdI tell you,; e. ...eee e a<br>brotli' \n \n '...<br>lzmaalmost entirely newlines (and it took an hour to do it!)

Everything except zlib quickly collapses into repeating one cheap byte. Why? Here's how the compressed length changes when you add a byte:

algocompress(a)compress(aa)

zlib910<br>bz23737<br>lzma6060

This is good compression, but gzipt's implementation of the beam search can't tell apart candidates, so it breaks ties towards some globally cheap byte. Essentially, we're having quantization issues; the information difference the beam is trying to look at is smaller than a byte, the smallest unit we're measuring in.

(on this toy input, zstd and brotli do move, but on a real 32KB context the one-byte difference rounds away into the whole-stream total and collapses too)

The only reason zlib works here also happens to be the clone trick. We measure each candidate's marginal cost directly, avoiding the whole-stream rounding. The others, lacking an API that allows us to clone the compressor state, have to recompress the entire context + candidate every single time, and that's why small...

gzip bytes beam language byte gzipt

Related Articles