Transformers Explained for Software Engineers

bharadwajp1 pts0 comments

Transformers and Attention: A Plain-English Guide for Engineers | Bharadwaj P<br>Skip to content

← Back to Blog<br>If you build software today, you almost certainly use models built on the transformer architecture, often as a black box. This is a look inside, no heavy math required.

Pictures over equations, small numbers over notation. (Written while working through the Deep Atlas program.)

TLDR

Each word becomes a row of numbers.

The model moves and bends those numbers, step by step, to expose useful patterns.

Attention lets every word adjust its meaning based on the other words around it.

The final numbers become a probability for every possible next word.

Pick a word, add it to the sentence, repeat.

Everything below is just zooming into one of those five steps. We will use a single running example the whole way through: the prompt "this too shall", which we are hoping the model completes with "pass".

First, words become numbers

Computers work on numbers, not words. So the first thing any language model does is turn each word into a row of numbers, called a vector. Stack those rows and you get a grid of numbers, which in ML we call a tensor. (Strictly, models split text into tokens, which are often subword pieces rather than whole words. I'll say "word" throughout for readability; swap in "token" wherever you want precision.)

These are ordinary decimal numbers, like 0.2 or -1.3, not the binary 0s and 1s the machine stores underneath. A word like "this" gets assigned its own row of decimals, and all the math from here runs on those.

Take "this too shall". Three words, so three rows. Say four numbers per word. That gives a tensor of shape 3×4, meaning three rows and four columns. The four is arbitrary here; real models use hundreds or thousands of numbers per word.

Before the model runs its layers, those numbers are close to a raw lookup with no sense of context. The model's entire job, layer by layer, is to reshape them until the last row is a good answer to the question "what word comes next?" It helps to picture each row as a single point sitting in space. The rest of this guide is about how those points get moved.

A layer just moves points

A neural network is built from layers, and each layer does something plainer than it sounds: it takes a set of points and moves them to new positions.

Multiply your points by a grid of numbers (a weight matrix) and they land somewhere new. That operation is a linear projection. It has one important limit. A linear projection can only move points along straight lines. It can stretch, rotate, and shear the whole space, but every point follows the same kind of straight path.

Two things worth heading off here. Nothing gets cherry-picked: the same matrix applies to every point at once. And the numbers inside that matrix are not chosen by hand. They start random and get learned during training, nudged over and over toward values that move points somewhere useful (more on how, later).

There is a catch worth knowing early. If you only ever stack linear projections, stacking buys you nothing. Two of them, or ten, collapse into a single equivalent projection. The math of multiplying matrices guarantees it. So a "deep" pile of pure linear layers is no deeper than one layer. It is one move wearing a trench coat.

The bend that makes depth worth it

To make depth pay off, you add a nonlinearity between the linear moves. The classic one is ReLU, short for rectified linear unit, and it is about as simple as a rule gets: keep positive numbers, turn negatives into zero. No calculus needed to read it:

ReLU(2.3) = 2.3 → positive, left alone<br>ReLU(-1.7) = 0 → negative, snapped to zero<br>ReLU(0.4) = 0.4 → positive, left alone

A good way to hold the whole pattern is three words: project, warp, project. The linear step slides points in straight lines. The nonlinearity bends the space, changing the distances between points in ways straight lines never could. That bend is the entire reason a deep network can learn things a shallow one cannot.

1 · Linear move (straight, parallel lines)

2 · Warp (ReLU snaps negatives to 0)

below 0<br>→ 0

A layer slides every point along the same straight vector (left). The nonlinearity then bends the space: ReLU pushes anything that fell below zero back onto the line (right). Without that bend, ten moves equal one move.

Here is why the bend earns its place. Without it, a layer can only re-describe the data, never genuinely reshape it. You get a different view of the same information. With it, the model can pull two points apart, push two others together, and actually change what the data means for the next layer. That is the gap between rearranging and learning.

Why ReLU and not something else? Nothing sacred about it. It zeros negatives, which is easy to picture and very cheap to compute, so it scales well. Other choices behave differently: GELU smooths the hard corner instead of snapping straight to zero, and gated variants like SwiGLU go...

numbers word points layer linear straight

Related Articles