What Is a Transformer Model? Self-Attention Explained

The transformer is the architecture under nearly every modern large language model. Its defining move was to drop recurrence and convolution entirely and rely on an attention mechanism. Here is what that means, from the 2017 paper that named it.

Ask what a transformer is and you usually get the name of a model — GPT, BERT, the rest — when the word actually refers to an architecture: a particular way of wiring a neural network. Here is the question worth answering instead: what did the transformer do differently from the sequence models that came before it? The earlier dominant designs were recurrent or convolutional. A recurrent network reads a sequence one element at a time, carrying a running summary forward — token one, then token two informed by one, then token three informed by two, and so on. That left-to-right dependence is intuitive, but it is also a bottleneck: you cannot easily compute step five until you have computed step four, so the work is hard to parallelize. The transformer's defining move was to throw that out.

The reference is the 2017 paper "Attention Is All You Need" by Ashish Vaswani and colleagues. Its proposal is stated as plainly as the title:

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.— Attention Is All You Need (arXiv:1706.03762), source

So what is the attention mechanism that the architecture rests "solely" on? The way this actually works is that, for every position in the sequence, the model computes how much each other position should influence it, and forms that position's new representation as a weighted blend of all the others — weighted by those learned relevance scores. When a model is attending within a single sequence to build representations of that same sequence, it is called self-attention. The practical effect is that a token at the start of a sentence and a token at the end can interact directly, in one step, instead of having information passed hand-to-hand down a recurrent chain. The model does not have to walk the distance between two related words; it can look across it in a single operation.

Why removing recurrence was the point

The advantage the paper emphasizes is not only quality but parallelism. Because there is no step-by-step recurrence forcing position five to wait on position four, the computation for all positions can be done together. The authors report that their models are "more parallelizable and requiring significantly less time to train," and they back it with a concrete training figure: a new single-model state-of-the-art BLEU score of 41.8 on the WMT 2014 English-to-French task "after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature." Read that carefully — the headline is a translation benchmark, but the quiet claim is an economic one. The same accuracy, reached for far less training compute, because the architecture maps cleanly onto hardware that does many things at once. GPUs reward parallel work, and the transformer is built to be parallel.

That single property — parallelizability — is most of why the transformer, rather than its recurrent predecessors, became the substrate for large language models. Training a model on internet-scale text is feasible only if the work spreads efficiently across thousands of accelerators. An architecture whose core operation can be computed for the whole sequence at once is far better suited to that than one that is inherently sequential. The 2017 paper demonstrated the architecture on machine translation and constituency parsing; the field then discovered that the same design, scaled up and trained on enough text, generalizes to a startling range of language tasks. The architecture stayed; the scale exploded.

It is worth being concrete about how attention turns into a computation, because the abstraction hides a simple shape. For each position, the model produces three vectors — conventionally called a query, a key, and a value. The relevance of position B to position A is computed by comparing A's query with B's key; those comparison scores are normalized into weights, and A's new representation is the sum of every position's value, weighted by those scores. Done for all positions at once, this is a set of matrix multiplications — exactly the operation accelerators are built to do fast and in bulk. The paper layers this into "multi-head" attention, running several such comparisons in parallel so the model can attend to different kinds of relationships simultaneously. The mechanism, in other words, is not exotic math; it is a lot of ordinary linear algebra arranged so the hardware can chew through it in parallel.

What the name does and doesn't tell you

A few clarifications keep the concept honest. First, "transformer" is the architecture, not a brand — the many named models built on it differ in size, training data, and which half of the original encoder-decoder design they use, but they share the attention-based core. Second, self-attention's reach comes at a cost the original paper's successors spend enormous effort managing: comparing every position to every other position grows quadratically with sequence length, which is why long-context efficiency is an active engineering problem rather than a solved one. Third, attention is a mechanism for relating elements of a sequence — it is not, by itself, understanding; what a model does with those relationships is learned during training, not conferred by the architecture.

Strip the mystique and the transformer is a precise engineering answer to a specific limitation. Recurrent models passed information down a chain and paid for it in serial computation. The transformer replaced the chain with attention, letting every part of a sequence relate to every other part directly and in parallel — and in doing so made it practical to train models on far more data than before. The 2017 paper's title was a thesis, and the decade since has been the field acting on it: attention, it turns out, really was most of what these models needed.

What a Transformer Model Is — and Why 'Attention' Replaced the Recurrence It Was Built On

Why removing recurrence was the point

What the name does and doesn't tell you

Comments