Can SAEs Capture Neural Geometry?

Research

The Neural Geometry Series → -->

AKA, how to use straight lines to capture curved geometry in neural networks

"Blind Men Appraising an Elephant", Ohara Donshu, Edo period. Like the proverbial blind men, individual SAE features each capture an incomplete view of a curved manifold in activation space.

Authors

Usha Bhalla*,1,2

Atticus Geiger†,1

Thomas Fel*,1

Ekdeep Singh Lubana†,1

Can Rager1

Sheridan Feucht1,3

* Equal contribution

Tal Haklay1,4

† Equal senior contribution

Daniel Wurgaft1,5

1 Goodfire

Siddarth Boppana1

2 Harvard University

Matthew Kowal1

3 Northeastern University

Vasudev Shyam1

4 Technion IIT

Owen Lewis1

5 Stanford University

Thomas McGrath1

Jack Merullo1

Published

May 21, 2026

Full Paper

Read on arXiv →

Curved geometry in neural representations is abundant and essential for understanding and controlling neural networks – but straight lines are much easier to work with.

Can we use lines to capture curved neural geometry?

Yes—but it's complicated. We are reminded of the old proverb in which blind men encounter an elephant for the first time. Each touches a different part—the trunk, the tusk, the leg—and each comes to a different conclusion about the elephant: the man touching the trunk says an elephant is like a snake, the one touching the leg says it's like a tree, and so on. Similarly, a single line can only give us a partial view of curved geometric structure, and consequently an incomplete understanding of the bigger picture. As with the elephant, the full meaning emerges only when the parts are understood as a whole.

In this post, we examine how the directions learned by sparse autoencoders (SAEs) relate to neural geometry, uncovering three different ways that lines can represent curved manifolds. Then, we leverage our understanding to develop an unsupervised pipeline for uncovering geometric structure in neural representations.

If we can automatically surface neural geometry, we can begin to understand neural networks at scale, on their own terms – by using the same internal geometry that they do. That in turn will enable deeper understanding and finer-grained, more robust control of neural networks.

Sparse autoencoders: missing the bigger picture

Sparse autoencoders (SAEs)[1]Cunningham et al. 2023 Bricken et al. 2023 Gao et al. 2024 Lieberum et al. 2024 Costa et al. 2025 Hindupur et al. 2025 Fel et al. 2025

See the following for a non-technical introduction. are a popular method in interpretability for decomposing neural representations using many different directions in activation space. These directions can be used to map out the inner world of a neural network; activations are expressed as a linear combination (weighted sum) of directions.

Interpretability researchers initially hoped that each direction, or feature, would be a single concept, and that the magnitude in that direction would correspond to something like intensity or confidence. While we now know that straight lines are not the universal "atoms" of neural cognition (see demos and citations from our main post), SAE features can still give us a window into more complex geometric structure.

Like the proverbial blind men and the elephant, no individual SAE feature can "see" the entirety of a curved manifold. But taken together, we can use the observations of a group of SAE features to reconstruct such a manifold.

Three ways SAE features can represent manifolds

We trained a sparse autoencoder on synthetic data that contained a mixture of geometric structures, including donuts, spheres, Möbius strips, and more. We found that as we change the number of directions used to reconstruct a representation, SAEs can recover geometric structures in three distinct ways:

Shattering. Every point on a manifold is separately represented with a unique feature. These features "tile" the curved structure by each pointing to a single spot on it.

Compact capture. All the points on a manifold are jointly represented with a small set of shared features. The SAE features act as a coordinate system for the manifold, though not the most natural one (since they are straight lines, and the structure is a curved surface).

Dilution. Points on a manifold are represented by a moderate number of features that are partially shared between points. This is conceptually in between shattering and compact capture.

When we train SAEs on actual neural network representations, we observe dilution. Directions correspond to regions on the geometric structure with different sizes and locations. Consider the examples below, of real SAE features and manifolds:

Each of the individual SAE features cover different parts of the manifold. For example, consider the temperature manifold. The feature labeled "Cold weather and its effects" fires at one end, while the feature labeled "Extreme heat and its effects on activities and environments" fires at the...

Can SAEs Capture Neural Geometry?

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play