Geospatial Indexing Explained: A Comparison of Geohash, S2, and H3 | Call me BenHome<br>Writings<br>About
Previous post<br>Next post<br>Back to top<br>Share post
Geospatial indexing, or Geocoding, is the process of indexing latitude-longitude pairs to small subdivisions of geographical space, and it is a technique that we data scientists often find ourselves using when faced with geospatial data.<br>Though the first popular geospatial indexing technique “Geohash” was invented as recently as 2008, indexing latitude-longitude pairs to manageable subdidivisions of space is hardly a new concept. Governments have been breaking up their land into states, provinces, counties, and postal codes for centuries for all sorts of applications, such as taking censuses and aggregating votes for elections.<br>Rather than using the manual techniques used by governments, we data scientists use modern computational techniques to execute such spatial subdividing, and we do so for our own purposes: analytics, feature-engineering, granular AB testing by geographic subdivision, indexing geospatial databases, and more.<br>Geospatial indexing is a thoroughly developed area of computer science, and geospatial indexing tools can bring a lot of power and richness to our models and analyses. What makes geospatial indexing techniques further exciting, is that a look under their proverbial hoods reveals eclectic amalgams of other mathematical tools, such as space-filling curves, map projections, tessellations, and more!<br>This post will explore three of today’s most popular geospatial indexing tools – where they come from, how they work, what makes them different from one another, and how you can get started using them. In chronological order, and from least to greatest complexity, we’ll look at:<br>Geohash<br>Google S2<br>Uber H3<br>It will conclude by comparing these tools, and recommending when you might want to use one over another.<br>Before getting started, note that these tools include much functionality beyond basic geospatial indexing: polygon intersection, polygon containment checks, line containment checks, generating cell-coverings of geographical spaces, retrieval of geospatially indexed cells’ neighbors, and more. This post, however, focuses strictly on geospatial indexing functionality.<br>Geohash<br>Geohash, invented in 2008 by Gustavo Niemeyer, is the earliest created geospatial indexing tool [1]. It enables its users to map latitude-longitude pairs to Geohash squares of arbitrary user-defined resolution. In Geohash, these squares are uniquely identified by a signature string, such as "drt".<br>The level-3 geohash in which I grew up! | Image by AuthorBut how are these strings generated?<br>The Geohash Algorithm<br>To map a latitude-longitude pair to a geohash is an elegantly simple algorithm:<br>1. Choose a `geohash-level`, or resolution. For our example, we'll choose `1`.<br>2. Create an empty binary array `S` of length `geohash-level * 5` (here, length equals `1` times `5`, so `5`).<br>3. For each geohash level, ask the question `5` times...<br>1. Is our point in the left half of the map? If so, append `0` to `S` and reset the map to be just the left half of the map; if our point is in the right half of the map, append `1` to `S` and reset the map to be just the right-half of the map.<br>2. Is our point in the bottom half of the map? If so, append `0` to `S` and reset the map to be just the bottom half of the map; if it's in the top half of the map, append `1` to `S` and reset the map to be just the top half of the map.<br>4. Convert every `5` bits from `S` into a Geohash 32-bit alphanumeric character, and return.<br>Computing a level-1 geohash. | Image by AuthorThis algorithm can be iteratively repeated arbitrarily many times, all the way down to geohashes that are less than a meter on each side!<br>Example iterative geohash subdivision. | Image by AuthorWhat’s particularly elegant about this algorithm is that, by following this pattern of “left is 0, right is 1; bottom is 0, top is 1”, the alphabetically ordered geohashes trace out a Z-order curve.<br>What’s a Z-order Curve?<br>The Z-order curve is a type of space-filling curve, which is designed just for this purpose of mapping multidimensional values (such as latitude-longitude pairs) to one dimensional representations (such as a string) [2].<br>Z-order curve of level 1 geohashes. | Image by AuthorGeohash is quite powerful: it’s simple, fast, and importantly, the geohash strings preserve spatial hierarchy (i.e. if your apartment is in the level 3 geohash "t1a", then it is also in the level 2 geohash "t1", and in the level 1 geohash "t"). However, you might have noticed a few issues with it by now…<br>Geohash’s Shortcomings<br>First, while the Z-order curve is convenient, it only weakly preserves latitude-longitude proximity in computed strings; particularly, due to edge effects of the Z-order curve, two locations that are close in physical distance are not guaranteed to be close in their...