Tree Transformers

Tree Transformers - by astle dsa - Theoretical Limits

Theoretical Limits

SubscribeSign in

Tree Transformers A step towards generalizing the transformer architecture

astle dsa Jun 20, 2026

After a billion architectures and a trillion variations, I finally found a transformer architecture that intrigued me. And this essay is step one towards the theory and the sub-field on which it was built1. I.

The representation of any dataset that we wish to approximate/predict/learn the distribution of, has converged to be vectors. This phenomena was not by accident, as vectors provide higher dimensional representations of the data, which is needed for mapping connections between data points, and also plays right into the strengths of a GPU, giving us incredibly parallelizable algorithms for deep learning. While these points are very true and have bore incredible fruits, we can always benefit by probing further. Every dataset has a unique internal structure2 which our models must approximate, in order to learn and predict. This structure, in the process of transforming it into a vector representation is lost (or, is very costly to preserve).3 What happens if, instead of transforming the dataset into a vector representation, we instead try to preserve what structure we have, and generalize the transformer in order to perform the operation on that structure itself, instead of vectors ? II.

As the title suggests, the first data structure I chose to explore was the ubiquitous tree structure4 . Trees appear more often then any other structure in computer science (except maybe graphs), and hence I chose to implement the self-attention mechanism, but for trees, before anything else. We first begin by defining a Node data structure, like so,

but with a crucial difference: it’s stored value can be another node, recursively. This structure itself becomes the essence of the deep network structures we care about: as matrix is a vector of vectors, so can this structure give us a “tree of trees“. This is essential while defining the very important outer-product or matrix multiplication equivalent for the tree data structure. The next step would be to define three very important functions which we will use in order to construct the self-attention mechanism. The functions are5: Pair : this function answers the question, how do we define an operation over two tree structures, i.e, how do we add/subtract/multiply/divide two trees ?

Counit : this function is analogous to the reduce function, or, a meaningful way to transform the tree data structure into a single scalar. A reduction operation.

Map : a very well known function. Basically given a function and a data structure (where the structure is strictly a collection of values), the map function applies the function to the entire collection.

Defining these three function for any data structure should essentially unlock the self-attention mechanism for that structure, hence we shall start with these three first, but for our Tree structure. Starting off with the Pair function, it should be a simple traversal and adding up the values of two trees:

The last node, n3 is a Tree of the same shape as n1 and n2, just initialized to be empty. Next is the Map function, which is extremely straightforward:

Kindly ignore the type warnings :) Now, before we write the last Counit function, we must define a simple Tree structure, which helps in storing the reduced value in it’s state.

Equipped with these three functions, we are all set for defining our own self-attention mechanism, on our way to write the tree-transformer ! III.

Now we must focus on the very basics of what operations go into creating a simple neural network (say a multi-layered perceptron) and build up to the operations required for self-attention, and see how they can be defined when the underlying data structure is not a vector, but a tree. First, we start with defining the dot product for the tree structure. The most important things that we have to keep in mind are that the transformation preserves the shape of the input vector, and that the weight vector here is a “vector of vectors“ also called a matrix. Each column in the matrix undergoes a reduce operation, before being multiplied by the input’s columnar value. This principle is to be followed in our tree structure as well: first we reduce (using the Counit ) the inner tree of the weights (remember, our weight tree is a tree of trees) and we then proceed to multiply it with the input tree’s node.

A simple dot product. Note that the input’s shape is preserved.

A dot product, but for the Tree data structure. Implementation:

With the dot product out of the way, we are getting ever closer to defining the self-attention function. We simply need one last function, which would help us create a tree of trees, that is, the outer product. The procedure is very simple: given two trees, we simply need to “insert“ the second tree, into each node of the first tree,...

Tree Transformers

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews