AI is Computation | Gabriel Pickard
You are using an outdated browser. Please upgrade your browser or install Google Chrome to experience this site.
Our first encouter with AI in the form of LLMs, within the product of ChatGPT was very much like an oracle. You ask a question, the oracle speaks, and that is the transaction.<br>This story of course has been less and less applicable as we march on to the glorious future of agentic AI which, somehow, does things -- maybe even computes.
The original framing for machine learning was function approximation. This framing was of course accurate, but it does fail to capture what has been going on at our current juncture, where ML has turned into something we a lot more justifiably can call: Artificial Intelligence.
I remember getting quizzical looks from colleagues interested in machine learning when I complained that you couldn't do general-purpose computation with a specific model. I think we were talking about word embeddings -- Word2Vec had just arrived, and the potential for operating on words in a continuous semantic space was immediately apparent. Without compositionality beyond simple concept arithmetic, though, a semantic space was merely interesting, useful perhaps for search, but nowhere near sufficient for computation in any meaningful sense.
A very well-organized dictionary is still just a dictionary.
A favorite interview question of mine back in the day -- which no interviewee ever answered satisfactorily, and which bedeviled me too -- was this: how do you cram a variable-size input into a fixed-size vector in a workable way, such that you can run it through a neural network without drowning in vanishing gradients? (The recurrent neural network was the non-solution of record.) Transformers answered the question decisively: you simply don't do that (by that I mean cram it all in a fixed-size vector). You maintain a representation that grows with the input, at the cost of quadratic computation on it.
With that, we were no longer beholden to the fact that while neural networks may be universal function approximators in theory, they were not, in their original form, designed to handle variable-size data with variable computational effort.
Or handle variable-size data at all.
Naive function approximation gives you a mapping, but ignoring the inherent scale of algebraic compositionality. Informally, computation is algebraic compositionality incarnate. The difference may seem pedantic right up until you need the latter.
Loops and compositionality
LLMs can obviously compute. Computation, understood 'properly', is modeled by following the rules of a formal language -- LLMs are simply extending this notion to instead follow the informal rules of an informal language. This may not meet the strict definitional requirements that logicians worked out roughly a hundred years ago, but it likely captures the spirit of what they were trying to define much better than we typically give it credit for.
Like the physical computers we are familiar with, language models are limited by finite context and memory. They are and remain imperfect models. (The interesting question -- as an aside -- is whether the imperfection lies in the theory or the implementation.)
Here is a fact about the history of computation worth holding onto: the most powerfu:l -- and indeed as I have argued before: excessively powerful -- thing we ever did was introduce arbitrary looping.
Primitive recursive functions are already remarkable -- you can compute a great deal within bounded recursion, with a termination guarantee on every call. But the decisive leap, the thing that opened up the full landscape of what is computable at all, was unbounded iteration: the mu-operator, the WHILE loop, the ability to say keep going until some condition is met and make no promises about when that will be.<br>The things we do around context compaction, subagents and discoverable context point in the same direction as the original argument for why we could treat practically finite computers as models for theoretically infinite machines: as long as you have a non-exponential algorithm, you can always add more -- more memory, more processing power and machines -- once you hit the limits of the original setup for a specific workload.
Now consider what a single forward pass through a language model actually is, computationally. A transformer with a fixed context window and finite-precision arithmetic computes a bounded function from an input sequence to an output distribution. The architecture is a circuit of fixed depth, not a machine with unbounded working memory. Each token attends to previous tokens, yes, but the computation terminates after a bounded number of operations determined by layer count, context window and implementation constraints.
This is not a criticism. The constraint is part of what makes training tractable. But it is a fact worth holding clearly when we ask what these systems can and cannot...