AI Interpretability Is a Revolutionary Skill

micahwhite1 pts1 comments

The Dark Between the Stars | Outcry Research

Outcry · The Dark Between the StarsPDFEmail

Contents<br>Opening<br>Geometry<br>The soft prompt<br>The demo<br>Sovereignty<br>Risks<br>Closing<br>Note on method<br>Early in life I discovered something about myself: certain ideas give me physical sensations. Reading Sophie’s World as a preteen, I found that particular passages — Zhuangzi’s butterfly dream, especially — produced a delightful tingling in the brain, something close to ASMR but cued by concepts rather than sounds. I have followed those sensations ever since. They are most of the reason I studied philosophy. They are most of the reason I have pursued the special interests I have pursued. Over time I learned that the unpleasant variants — the claustrophobic ones that come from the photo of Berry Cannon in the underwater SEALAB II, the thought of Voyager I hurtling further and further from Earth (that one produces a sense of terrifying vastness) — were just as worth following as the pleasant ones, and arguably more so, because they tend to serve as guideposts to unexplored, and unarticulatable, areas of my mind.<br>For the last several months I have been following one of these signals into a place I did not expect to end up: the non-linguistic interior of an artificial intelligence language model. The sensation is strong and unusual (distinct from others I routinely experience) and I cannot fully name it yet. What I can tell you is that it gets stronger as I move to understand the region of the AI's interior mental model that has no words in it — a region the model's thought nevertheless passes through every time it writes — and that the closer I get to visualizing that region in order to provoke the sensation, the more I suspect the work is not really about AI at all. It is about what it means for a mind, any mind, to know and learn to express something it cannot say. This essay is concrete about the AI part. The deeper claim, the one the sensation keeps insisting on, is suggestive but I'll admit I have no evidence for (yet).<br>A modern language model is, among other things, a dictionary. Not the kind on a shelf — the kind that has been pressure-cooked out of a trillion words of internet text and left as residue inside a few hundred billion numerical weights. Somewhere in that residue are the concepts the model has learned to think with. Bridge. Refusal. Sentiment. Advertising. A year ago Anthropic made this vivid with Golden Gate Claude, a version of their assistant in which the internal concept for the Golden Gate Bridge had been turned up so high the model could barely talk about anything else. The point of the demo was that the dictionary is real, inspectable, and — crucially — editable.<br>The point I want to make here is that the dictionary is also small, and the words most vital to you, and by extension all of us, may not be in it.<br>Before going further, I need to pause to be specific about this, and the specificity matters, because the model class I am about to describe is not the one you talk to through ChatGPT or Claude. For the purposes of this essay, I am only talking about the types of open source AI models that enable activists to build local, private AI. Adam Karvonen recently published an interpretability dictionary for Qwen3-8B, an open-source model in the same weight class as the ones a movement can actually run on its own hardware — downloaded once, run on a laptop, no API key, no per-token fee, no continuous internet connection, totally private. The dictionary maps 64,947 concepts that are ready to grasp by the AI within the AI, each one a direction in the model’s internal activation space, each one labeled automatically by Gemini. That sounds like a lot until you go looking for something particular. I went looking for the vocabulary of four activist traditions I care about: the Adbusters lineage I came out of, Guy Debord’s Situationists that inspired Adbusters, John Zerzan’s green anarchism which pushes the limits of radical critique, and the Black Lives Matter / Afrofuturist tradition which is integral to any struggle. Twenty-five concepts in total — the kind of words that appear on the spines of canonical books and in the citations of working organizers.<br>Zero came back as clearly present. Twenty-two were absent entirely. Kimberlé Crenshaw’s intersectionality, the most-cited concept in critical race theory of the last three decades: absent. Angela Davis’s prison abolition, the spine of the contemporary BLM platform: absent. Debord’s society of the spectacle, the central concept of an entire post-1968 tradition: absent in any meaningful sense. Even civil disobedience and nonviolence, mainstream high-school-curriculum concepts, were barely in the AI's dictionary of concepts. The model has plenty of room for protest, revolution, and voting — those landed cleanly — but the actual working vocabulary of the last sixty years of social movements is, for practical purposes, not there.<br>Before the obvious objection lands, I...

model concepts dictionary something words absent

Related Articles