A Transformer Is All You Need | Zenodo
Skip to main
You are using an outdated browser. Please upgrade your browser to improve your experience.
Published June 26, 2026
| Version v1
Preprint
Open
A Transformer Is All You Need
Authors/Creators
Lamoureux, Marc
Description
The unanswered question in mechanistic interpretability of pretrained transformers is plain: for any prompt and any decoder-only transformer, which weights at which layers along which residual-stream dimensions produced the decision the model emitted? Activation probing reports a per-depth accuracy curve. Sparse dictionaries decompose activations into monosemantic features. Logit and tuned lenses trace the trajectory of a prediction through the residual stream. None of these names the weight that did the work. The weights are the artifact training produced, the substrate every activation must traverse, the only object in the system that persists across forward passes; interpretability that treats them as a fixed backdrop describes what the model is doing right now, never why this particular model with these particular weights had to do it.
We close that gap with one primitive — the alignment of a residual-stream activation with the top singular directions of a weight matrix, scaled by the singular values — and a small cross-layer transformer (the hybrid weight–activation probe) that consumes the joint (activation, alignment) sequence and predicts the host model's next-token decision. As a byproduct of training, the probe exposes per-layer importance (the depth at which the host's decision crystallized) and per-layer alignment importance over the three weight families Q/K/V, attention output, and MLP up/gate (which family at each layer carried the decisional signal, and via the SVD along which singular directions). A separate gradient-attribution pass through the host model closes the causal loop, confirming the weights the probe identifies are the same weights whose perturbation moves the host's logit on that decision. The pipeline answers, for any prompt on any frozen pretrained decoder-only transformer, the question every prior interpretability tool has had to leave open: which weight, at which layer, along which dimensions, produced this token.
We demonstrate the pipeline on four structurally distinct decoder-only transformers spanning five years of architectural and training evolution: GPT-2 medium (2019, 355M, WebText), Pythia 2.8B (2023, 2.8B, the Pile), Mistral 7B v0.1 (late 2023, 7.3B, SwiGLU/RMSNorm/GQA/sliding-window), and LLaMA 3 8B base (2024, 8B, SwiGLU/RMSNorm/GQA, 128K-token tiktoken vocabulary, 15T training tokens). On all four the probe converges well above the 0.001 random baseline over a compact 1024-token target vocabulary and produces a coherent per-prompt attribution report; absolute accuracy serves only as a chance-baseline sanity check, and the attribution result is invariant under any above-chance probe accuracy. As an unplanned byproduct of running the same pipeline on this panel, the per-weight-family attribution proportions on all four hosts lie within ℓ₁ distance 0.019 of the uniform [1/3, 1/3, 1/3] vertex of the 2-simplex, with a maximum pairwise ℓ₁ separation of 0.034. We did not engineer this observation and did not select hosts to produce it; we report it as a downstream finding, not as the central claim.
From the single primitive of weight-level causal decision attribution follow nine capability families: per-prompt visibility into the decision pathway at every layer; causal diagnostics with no behavioral inference; weight-level surgical intervention on specific model behaviors with no retraining, fine-tuning, or RLHF; capability operations (localization, extraction, transplantation, removal); security and forensics including backdoor, sleeper-agent, distillation-source, and post-training tampering detection; safety-specific detection of deceptive alignment, sandbagging, hidden goals, evaluation awareness, sycophancy, pressure deception, reward hacking circuits, and specification gaming at the structural substrate; training economics through capability-preserving compression and targeted fine-tuning; cross-lab audit capability over any transformer family with no method rebuild; and comparative analysis across architectures, training methods, checkpoints, fine-tunes, and merges. The instrument is the result. The reproducibility observation is one of its dividends, not its claim.
Files
TransformerIsAllYouNeed.pdf
Files<br>(148.6 kB)
Name<br>Size
Download all
TransformerIsAllYouNeed.pdf
md5:dd8ece9df8359368c6cb16a3b492b299
148.6 kB
Preview
Download
Views
Downloads
Show more details
All versions<br>This version
Views
Total views
Downloads
Total downloads
Data volume
Total data volume
0 Bytes<br>0 Bytes
More info on how stats are collected....
Versions
External resources
Indexed in
OpenAIRE
Communities
Keywords and subjects
Keywords
mecahanistic interpretability
transformer...