Are Transformers Turing-complete? A Good Disguise Is All You Need.
Skip to content
I will avoid an introduction and get straight to it. I’ll assume that you know what Turing-complete means, that you are familiar with transformers, and that you care about the question: Are transformers Turing-complete? (I may later write about what that question means and why it is important.)
Several papers claim that transformers are Turing-complete and therefore capable of computing any computable function. But I believe these claims are all misleading. Some make conceptual errors and are simply incorrect in their claims of Turing-completeness for their models. Others have modified the essential properties of transformers to achieve universal computation, presenting models that only resemble transformers superficially. As of October 2024, the deep learning model widely known as a ‘transformer’ (despite its success in commercial large language models) has not been shown to be Turing-complete (barring add-ons and modifications to its essential properties).
CORRECTION NOTE: In an earlier version of this post (April 22, 2023), I incorrectly stated that it is "easy to prove that transformers aren’t Turing-complete" because "any transformers can be simulated by a finite-state-machine". That statement is false. My mistake was to assume that the maximum context window length was a limitation inherent to a specific transformer model. But I learned that there really is no maximum context window. The "width" of the model (usually "d_model") has nothing to do with the number of tokens that a transformer can process. (I was fooled by how everyone calls the transformer a feed-forward neural network. But the attention layer in transformers isn’t quite a feed-forward neural network; it cannot be modeled with a fixed number of neurons and connections). A transformer can, in principle, process sequences of arbitrary length. It may not perform well beyond a certain length, but that’s a separate discussion about training and generalization. So you cannot simulate transformers with finite-state-machines. And since the model can store information in its growing sequence of tokens, this leaves open the possibility that transformers are, in fact, Turing-complete. I want to thank Nishanth Dikkala and Kevin Miller for bringing this mistake to my attention. In light of this realization, I have revised some of the critiques I brought up of specific papers.
Below, we’ll closely examine each paper that implies, or can be misinterpreted as claiming, that transformers are universal computers. In doing so, we will learn about the inner-workings of a transformer and understand what obstacles need to be overcome to create a machine learning model that learns functions within the scope of all computable functions.
Before we dive in, let me clear something up. My critiques below are not about hardware limitations or implementation. Some models rely on numerical values with arbitrary precision to ‘attend’ to tokens in the sequence. That’s fine. It’s common to use 64-bit floating point precision types, which are incapable of representing arbitrary precision values, but that’s an implementation choice separate from the abstract model of a transformer. In the same way that you can switch to a computer with more memory, you can always switch to higher fixed-precision to run a transformer on something that needs that extra boost to execute properly. You can even abandon the floating-point format altogether and store numbers as strings representing the numerator and denominator. As long as you don’t rely on infinite-precision, arbitrary finite precision is easily implementable in a digital representation scheme.
From the sixteen papers examined below, five incorrectly claim that their model is Turing-complete [1, 5, 10, 13, 14], eight present Turing-complete models which resemble transformers but operate in critically different ways, sometimes making them irrelevant to deep learning [1, 2, 3, 4, 8, 11, 12, 16], and three papers do not claim Turing-completeness but risk being misinterpreted as such [6, 9, 15]. (Categories are not mutually exclusive, and paper [7] does not even present a model).
In chronological order:
[1] Universal Transformers (2018)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser
The authors provide an informal proof. "Given sufficient memory the Universal Transformer is computationally universal – i.e. it belongs to the class of models that can be used to simulate any Turing machine,… To show this, we can reduce a Neural GPU to a Universal Transformer. Ignoring the decoder and parameterizing the self-attention module, i.e. self-attention with the residual connection, to be the identity function, we assume the transition function to be a convolution. If we now set the total number of recurrent steps T to be equal to the input length, we obtain exactly a Neural GPU. Note that the last step is where the...