Playing with Vision Embeddings | Preston Jensen<br>Corn kernels
Triumphal Arch
Kernel Arch
Embeddings are, in a sense, the native language of neural networks. They are how networks can encode a rich variety of semantically meaningful representations with just a list of numbers. However, those numbers are frustratingly opaque. You certainly won't be able to make sense of them by reading them one after another. In this post, we try to make sense of one neural network's embeddings.<br>The Model
The model we're going to be looking at in this post is DINOv3 ViT-S (Siméoni et al., 2025). DINOv3 is interesting because it learns to map raw pixels to a rich feature space with very few priors. It doesn't know language, it can't describe what it sees, but it still learns to make sense of images. We won't go into full detail about how DINOv3 was trained, but two things matter for this post: it compresses any image into a single embedding (a list of 384 numbers) and it was trained so that different crops and augmentations of an image will have similar embeddings. Our goal is to understand what information is encoded in those 384 numbers.<br>Generating Images from Embeddings
In order to start playing around in this 384-dimensional space, we need some way to translate these numbers back into something that humans can understand. The most natural place to do this is in the one language humans and DINOv3 both understand: images. More concretely, we want to be able to take a point in this 384-dimensional space and generate an image that DINOv3 says would coincide with that point.<br>To do this, we leverage two ideas. The first is that DINOv3 is fully differentiable -- when you feed an image into the model, you can tweak the pixels to make the output vector closer to some target. So we can maximize cosine similarity between the generated image's embedding and some target embedding. People have generated images this way for a while, see for example DeepDream (Mordvintsev et al., 2015) and Olah et al.'s feature visualization work (Olah et al., 2017).<br>The second is that DINOv3 was trained such that different crops and augmentations of an image land in the same embedding space. We can mimic that same cropping and augmentation strategy when we build up the gradient for the pixels. This helps in two ways: first, it stops the optimizer from cheating with high-frequency noise (Olah et al., 2017); second, it optimizes for the model's own definition of sameness.<br>There are a couple of other tricks we use to make the images look nicer: we produce the image with an untrained transformer backbone (similar in spirit to Deep Image Prior, Ulyanov et al., 2017), and we minimize an auxiliary total variation loss.<br>Once this pipeline is set up, when given an arbitrary direction in 384-dimensional space, we can generate an image that DINOv3 says would point in that direction. For example, below we take a photo of an alpine landscape, compute its DINOv3 embedding, and then use our generation technique to produce an image that points in that same direction.
Original
Raw pixels no augmentations
Raw pixels with augmentations
Transformer with augmentations
Now, the generated image seems to capture the general vibe of the original image. It clearly shows mountains, snow, and a lake. Take a look at the spread of images generated to get a fuller sense of what's generated:
You'll see some variation generation to generation (after all, we are compressing an entire image down to 384 numbers, which is inherently a many-to-one operation). But you'll also notice that there are a few common ways that they differ from the original. They're more saturated, higher contrast, and they misplace/duplicate some of the objects in the scene. Much of this is likely due to the image generation pipeline. Try to keep in mind these telltale signatures of the generated images so you can try to mentally invert them as we proceed.<br>Finding the features<br>The first thing we need to understand before we start trying to pick apart the 384-dimensional space is that DINOv3 encodes far more than 384 distinct visual concepts into those 384 numbers. How? The leading hypothesis is something called superposition : models learn to cram many times more features than the dimensionality of their embeddings by pointing each feature in a nearly-orthogonal direction (Elhage et al., 2022).<br>To demonstrate this phenomenon, we'll show how a small toy neural network can squeeze 10 MNIST digit classes through a 2-dimensional bottleneck. Every frame here corresponds to one step by the optimizer so you can see the model learn to represent all 10 classes.<br>10 digit classes squeezed into 2 dimensions. Each class gets its own slanted direction.
The key observation is that neural networks tend to place features along directions in their hidden space. Each of those 10 digit clusters above appears to point in distinct directions out from the origin (directions annotated in the image below). In 2 dimensions there's...