Chopped, Stored, Secured — The Story of the Hash Function | Math behind Security
A Python implementation for all of these algorithms is available in my GitHub repository.
Peter Luhn came up with the idea of using math to verify and store information, Birthday problem explains the collisions in Hash tables, Rabin-Karp algorithm uses rolling hash to search strings. We’ve talked so much about hash, yet not a single time was the definition of the hash function itself provided.
When did hashing first appear?
In 1956 the idea of hashing was for the first time defined by Arnold Dumey. His passion for cryptography since 14 and mathematics degree from Columbia University brought him first to the US Army Signal Corps, then to Potter Instrument (printer producing company also interested in engineering of computer memory). Later in his interview for the Charles Babbage Institute Dumey said: “I wrote a paper, which was the first paper on hash coding, based on the work I did there [at Potter Instrument]”. In that paper, Dumey described the concept of using a mathematical transformation to map data to memory addresses for faster search and called it indexing.
How did indexing work?
In his paper, Dumey gave clear instructions.
1. To store a word in memory, first transform it into a number.
Let’s take the word “BOX” as an example. We can convert it, as Dumey suggests, into a base 37 number or just use ASCII. We will go with the first option. To do that, we map letters to their alphabetical positions
A=1<br>B=2<br>...<br>X=24<br>Y=25<br>Z=26
Then we multiply them by powers of 37.
In our example, the numeric value of the word “BOX” is
\[2 \times 37^2 + 15 \times 37^1 + 24 \times 37^0 = 2738 + 555 + 24 = 3317\]
Early computer systems and punched tapes primarily dealt with letters and numbers only: 26 letters of the English alphabet, plus 10 numbers, and a space add up to 37 available symbols.
2. Find a prime number slightly less than the number of addressable locations.
In our example, we will have 100 addressable locations. In this case, the nearest prime number is 97.
3. Divide the numeric value by the prime number…
…then throw away the quotient and use the remainder. In other words, perform our favorite modulo operation!
\[3317 \pmod{97} = 19\]
This means the word “BOX” will be stored at memory address 19.
What’s the difference between indexing and hashing?
There’s none. Later indexing will be called a polynomial hash, which we’ve already covered in Rabin-Karp algorithm a few weeks ago. The word hash itself originates in the French word hacher (“to chop”, “to mince”) which pretty much reflects what a hash function does with information before storing it in memory. For some time it was just a widespread jargon among cryptographers. In the late 1960s, the term hash appeared officially for the first time in Herbert Hellerman’s “Digital Computer System Principles”.
So that’s how cryptographic hashes work?
Not really. The idea is the same: use mathematics to transform data. However, the goal now is to also make that data protected from attackers.
The definition of the cryptographic hash that I'll give here is based on a very interesting paper, "New Directions in Cryptography"(1976) by Whitfield Diffie and Martin E. Hellman. It's engaging, easy to read, and covers not only cryptographic hashing but also RSA, so I can really recommend it for further readings.
So what is a cryptographic hash?
Hash is a value used to secure vulnerable data. For example, websites store your login information as a hash, not as plain text, to protect it from being stolen by an attacker. When you register, you enter the password (PW) for the first time. The website produces a hash by using a one-way function f(x), also called a hash function, and stores f(PW). For each of your further logins, the website will calculate f(x) with the data you enter. Only if f(x) equals f(PW) will the password be accepted.
Since your password can still be stolen on its way from your computer to the website, password hashing is usually used together with encryption protocols like TLS to protect your data in transit.
Why is storing f(PW) safer than storing PW?
The main property of the one-way function that creates a hash is its irreversibility. It’s computationally impossible to go back from hash to plain text even if you have the function used for the transformation.
Computationally means that even with the most powerful computer, it will take ages to find the original value.
Why would it be so hard to reverse a hash function?
At school we learned that every function has an inverse function: addition is reversed by subtraction, multiplication by division, an exponent by a logarithm. However, for some functions, there’s simply no known “undo button”. Such functions are called preimage resistant. Let’s look at some examples.
1. High-degree polynomial
The polynomial of degree n looks like this:
\[p(x) = a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x +...