The shared recipe behind search: Images, Shazam and RAG

The shared recipe behind search: Images, Shazam and RAG | by Pablo Carneiro Elias | May, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Pablo Carneiro Elias

15 min read· Just now

Listen

Or: why every modern “find me something similar” feature is the same problem in disguise — and what makes it hard at scale. RAG stands for retrieval-Augmented Generation Open Google Images, drop in a photo, and a few hundred milliseconds later you get every page that hosts a near-duplicate of it. Hum into Shazam and it tells you the song. Type a question into ChatGPT and, before answering, it pulls the three most relevant documents out of a database that might have a hundred million entries. Spotify suggests songs that feel like the one you’re playing. Tinder shows you faces that look like the ones you swipe right on. These all look like wildly different features built by different teams. They’re not. Under the hood it’s the same trick, repeated four times: Turn the messy real-world thing — a photo, a song, a sentence, a person — into a vector of numbers . Drop that vector into a shared space along with millions of others built the same way. When a query arrives, turn it into a vector too. Find the nearest existing vectors to the query. Those are your answer. The first step is called making a descriptor. The fourth step is nearest-neighbour search , often shortened to KNN (k nearest neighbours). This post is a tour. We’ll start completely from scratch — what a descriptor even is, with examples from images, audio, and text. Then we’ll watch all of those collapse into the same geometric problem, look at why that problem is genuinely hard at high dimensions (the curse of dimensionality), and end with a specific trick that works really well for binary descriptors: Multi-Index Hashing (MIH) . Along the way I’ll show you some benchmark numbers from pynear, the small library we built for exactly this kind of work. This is aimed at curious developers — no machine-learning background required. The math sections are marked, and you can skip them. 1. Descriptors: the steady bridge from messy data to geometry A descriptor is a small summary of a thing — small enough that you can store millions of them in memory, structured enough that similar things produce similar descriptors. That second property is the whole game. If similarity in your problem maps to closeness in the descriptor space (or what we call a “vector” or point in that space), then finding similar items reduces to finding close vectors, which is a geometry problem. Geometry has decades of fast algorithms development behind it. Domain semantics (like “are these two cats the same breed?”, “are these two paragraphs talking about the same thing?”) do not. So: how do you build a descriptor that has this property? It depends on the domain. Images — SIFT, ORB, and friends A photograph is a few million pixels. You can’t compare two photographs pixel-by-pixel — shift the camera one pixel to the left and every comparison fails. What you can compare is the local structure at the interesting points: corners, blobs, places where the image has distinctive texture. SIFT (Scale-Invariant Feature Transform, 1999) and ORB (2011) are recipes for turning each interesting point into a fixed-size descriptor — typically 128 floats for SIFT, 256 bits for ORB. The recipe, in spirit: Find a few hundred “interesting” points in the image (corners, blobs). Around each one, look at a small neighbourhood (say 16×16 pixels). Compute the dominant gradient direction in that patch, and rotate the patch to point that direction up . This is the rotation-invariance trick: no matter how you tilt the camera, the descriptor for the same physical corner comes out the same. Split the rotated patch into sub-cells, build a histogram of gradient directions in each cell, concatenate them all → that’s your descriptor vector. The clever part is step 3. A corner of a building looks like a corner whether you photograph the building from upright, sideways, or upside-down. Without orientation alignment, those three photos would produce three totally different descriptors and pixel-matching would fail. With it, all three give you (approximately) the same vector. ORB does the same thing but cheaper: instead of gradient histograms, it computes a few hundred binary tests (“is this pixel brighter than that pixel?”), packs the results into a 256-bit string, and ships it. Each ORB descriptor is therefore 32 bytes. A 1 GB index can hold ~30 million of them. Same recipe works on faces (FaceNet, ArcFace), fingerprints (detail descriptors), satellite-image patches, MRI volumes, anything spatial. Audio — Shazam, Chromaprint, and the spectrogram Sound is a one-dimensional waveform that nobody can usefully compare in the time domain. Convert it to a spectrogram (frequency over time) and suddenly there’s structure: a song...

The shared recipe behind search: Images, Shazam and RAG

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits