Ben Kallus's cool website
Ben Kallus
Weird Design Choices in HPACK Strings
Background
HPACK is the format used to represent header fields in HTTP/2.<br>HPACK uses its own string encoding, consisting of a boolean flag to indicate whether the string is compressed, an unsigned integer length ℓ, and ℓ bytes of string data.
For compressed strings, the decompression works like this:
Begin with a bit string S (the data to decompress) and a list of codewords (provided in the HPACK RFC). Each codeword corresponds to a byte value.
If S consists of a string of between zero and seven 1 bits, we're done (needed for byte-alignment).
If S.starts_with(c) for some codeword c, output that codeword's associated byte value, pull those bits off the front of S, and goto 2.
Otherwise, S is invalid. Fail.
For example, decoding the bit string S = "10111111000011110001110000011111" would go like this:
S starts with "1011111", which is the codeword for the ASCII byte "D", so emit "D", and set S = "1000011110001110000011111"
S now starts with "100001", which is the codeword for the ASCII byte "A", so emit "A", and set S = "1110001110000011111"
S now starts with "1110001", which is the codeword for the ASCII byte "V", so emit "V", and set S = "110000011111"
S now starts with "1100000", which is the codeword for the ASCII byte "E", so emit "E", and set S = "11111"
S is now just five 1 bits, so we're done! The decoded string is "DAVE" (this guy)
Weird Choice #0: The EOS Codeword
Remember when I said this?
> Each codeword corresponds to a byte value.
I was not telling the truth!<br>In reality, the HPACK RFC specifies one extra codeword called "EOS" (end-of-string) that doesn't correspond to a byte value.
You might think this would be used to terminate strings, but it's not.<br>HPACK strings are length-prefixed, so there's no need for a terminator.<br>Instead, this codeword is just invalid.<br>It's invalid to send it, and it's invalid to receive it.
So why is it in the spec at all?<br>Well, remember when I said this?
> If S consists of a string of between zero and seven 1 bits, we're done
This is not the way that the RFC explains it.<br>The RFC says this:
> As the Huffman-encoded data doesn't always end at an octet boundary,<br>> some padding is inserted after it, up to the next octet boundary. To<br>> prevent this padding from being misinterpreted as part of the string<br>> literal, the most significant bits of the code corresponding to the<br>> EOS (end-of-string) symbol are used.
The RFC's explanation is equivalent to my explanation, because the EOS codeword is just thirty 1 bits.<br>I have no idea why they chose to do it this way.<br>The stated reason of "to prevent this padding from being misinterpreted as part of the string" doesn't make any sense, because there are plenty of *valid* codewords that begin with seven 1 bits (e.g., the codeword for "!", which is 1111111000).<br>In my opinion, a simpler and equally effective choice would be to allow padding with any incomplete codeword.
Weird Choice #1: Using One Set of Codewords for Names and Values
HPACK strings are used (exclusively, afaict) to encode HTTP/2 header names and values.<br>These are not identically distributed, so using different codes for header names and values would result in better compression.
To definitively prove this, I would need a representative sample of HTTP headers.<br>Ideally, I could use the same sample that was used to produce the original HPACK Huffman code.<br>Unfortunately, this is all that the RFC says about the source of the Huffman code:
> This Huffman code was generated from statistics obtained on a large<br>> sample of HTTP headers.
I emailed the RFC authors and asked if they had any more information.<br>The one who responded said that he doesn't exactly remember where the code came from, but it might have been Chrome telemetry.<br>If that's the case, there's a 0% chance that I'm getting access :(
Weird Choice #2: Having Codewords for Every Byte Value
The bytes 0x00, 0x0a, 0x0d are disallowed within header values, so we could remove those codewords from the Huffman code for header values.<br>This might give a slightly better compression ratio, but it would have the much larger benefit that any compressed HPACK string would be guaranteed not to contain those disallowed bytes, which have caused problems in the past.
For header names, it's even better: only 51 byte values are valid within an HTTP/2 header name (ASCII lowercase, ASCII numerals, and 15 ASCII punctuation marks/other symbols).<br>That means a much more efficient Huffman code with only 51 codewords could be used to encode header names instead.<br>This would almost certainly lead to significantly better compression, and again guarantee that disallowed bytes won't appear in decompressed header names.