Arithmetic Without Numbers
Exhibit 1
Integers as phase
interactive
The spiral is a simplified picture of a Fourier-style number code: one part of the vector tracks phase around a circle, while another tracks coarse position.
integer<br>137
phase49.3°<br>cos0.65<br>sin0.76<br>coarse13
integer value
999
The question
A model has no fingers
If you learned arithmetic the ordinary human way, you probably learned it with a body. You counted on fingers. You grouped things into piles. You lined digits into columns. You carried a one. Later, perhaps, you used an abacus, graph paper, or a calculator.
A language model has none of that. It has matrices. Tokens enter, activations flow, logits come out. And yet, if you ask a modern language model for a greatest common divisor, a multiplication, or a division with remainder, something inside that matrix-only body responds.
Working vocabulary<br>Token: one unit the model reads or prints. A token might be a word, part of a word, punctuation, or a chunk of digits.
Vector: a list of numbers. A model stores each token's current state as a vector with many dimensions.
Activation: the model's temporary internal state while it is processing a token.
Readout: a small external model trained to recover a fact from an activation, such as the operation or an operand.
Logit: a raw score for a possible next token. Higher logit means the model is more likely to print that token.
Layer: one repeated processing step in the transformer. A modern model has many layers, each updating the running state.
Residual stream: the main running vector passed from layer to layer, like a shared scratchpad without named variables.
Attention: the part of a layer that lets one token position look at information from other positions.
MLP / feed-forward block: the part of each transformer layer that transforms one token position's vector by itself. Attention lets positions exchange information; the MLP then reshapes the local vector, often strengthening, suppressing, or recombining features already present there.
Next-token prediction: the training and generation rule for ordinary language models. Given the text so far, the model scores possible next tokens, prints one, then repeats the process.
Phase: position around a repeating cycle, like the angle of a hand on a clock. Helix-style number codes use phase-like geometry.
GCD / LCM: greatest common divisor and least common multiple. For example, gcd(84, 36) = 12.
Rune began with the debugging question behind the jargon: when a language model gives an arithmetic answer, is it recalling a pattern, running something like an algorithm, or merely producing a plausible next token?
The human contrast
We learned arithmetic with bodies
George Lakoff and Rafael E. Núñez argued in Where Mathematics Comes From that human mathematical ideas are grounded in embodied experience: grouping, moving, measuring, balancing, collecting, and mapping one domain onto another. Whatever one thinks of the full philosophical claim, it is a useful starting point for this story.
A transformer has no fingers, no beads, no written columns, and no scratch paper. It has token embeddings, attention, feed-forward networks, residual streams, and matrices. If it learns arithmetic at all, it has to invent a machine-native version of number.
Humans also do arithmetic in more than one way. We answer 7 x 8 from a memorized multiplication table. We may divide 963 / 17 by running a written algorithm. We estimate tips with shortcuts. So the first scientific problem was not just "can the model answer?" It was "what kind of answering is this?" A memorized table, a learned shortcut, and a real multi-step calculation can all print the same number.
Residual stream
A vector changes as the model reads
Before we can ask whether a number is memorized, computed, or merely rendered, we need one more piece of machinery: the model's running state.
Imagine reading What is the gcd of 84 and 36? one token at a time. The model does not create a neat little variable named operand_a. Instead, each token position carries a long vector of numbers. As the prompt passes through the transformer layers, those vectors are updated again and again.
Some updates move information across positions: the token for 36 can affect the state near the answer position. Other updates reshape the local state: a direction in the vector may become more gcd-like, more operand-like, or more answer-like. The residual stream is the running scratchpad where those changes accumulate.
This is why readouts and patches are possible at all. If the operation and operands leave traces in the residual stream, a small readout may recover them. If a state really matters, a patch may change behavior. If a state is writable, an intervention may guide the model. But those are increasingly strong claims, and the vector itself does not label which claim is true.
tiny residual vector
layer update: scratch vector + attention +...