The Transformer & Attention: The Architecture That Lets Every Word See Every Other Word

TL;DR — Scope

This note covers the architecture itself: how the model processes the embedding vectors from Foundation 01. The transformer is the structure; attention is its core operation — letting every token look at every other token and pull in what's relevant. It spans self-attention (query/key/value), multi-head attention, the transformer block (attention + feed-forward + residual + normalization), how blocks stack, positional encoding, and the masked, decoder-only design modern LLMs use. It excludes how the model is trained (Foundation 03) and how text became vectors in the first place (Foundation 01).

The Problem

Before transformers, the best sequence models were recurrent networks — RNNs and LSTMs. They read a sentence the way you might read aloud: one token at a time, left to right, carrying a running memory of what came before.^[1] That design has two crippling limits. It can't be parallelized — each step waits on the previous one, so training is slow.^[10] And long-range dependencies fade: by the time the model is hundreds of tokens deep, a detail from the start has washed out (the vanishing-gradient problem).^[10]

The 2017 paper “Attention Is All You Need” replaced recurrence entirely with self-attention, letting every token look at every other token directly and in parallel.^[1]^[2] The result was a massive training speedup and a far better grip on long-range relationships. The price — which the later foundations return to — is that word order is no longer built in, and compute grows with the square of the sequence length.

FIG. 1 — The breakthrough. An RNN passes a fading memory forward one step at a time; the transformer wires every token directly to every other and processes them all in parallel — faster to train and far better at long-range links.

The Concepts

Self-Attention & Query/Key/Value

This is the core operation. Each token's embedding is transformed, through three learned weight matrices, into three vectors: a Query (what am I looking for), a Key (what do I offer), and a Value (the content I'd contribute).^[4]^[3] To update a token, the model compares its query against every token's key to get similarity scores, turns those scores into weights, and produces a weighted blend of all the values.^[3] The token's new representation is therefore a context-aware mixture — it has literally pulled in information from the tokens most relevant to it.

FIG. 2 — Self-attention. The token “it” asks (its query) which other tokens matter; the strongest match (“cat”) dominates the weighted blend of values, so “it” comes out of the layer carrying “cat’s” meaning. Illustrative weights.

Multi-Head Attention

A single attention operation yields one weighted average, which blurs many kinds of relationship together.^[6] So transformers run several attention operations in parallel — heads — each with its own query/key/value matrices, each free to focus on a different kind of relationship (one head might track grammar, another a long-range reference).^[5]^[6] Their outputs are combined into the token's final representation.^[4]

FIG. 3 — Multi-head attention. The same token is examined by several heads at once, each attending to a different kind of relationship; their views are merged into one richer representation.

The Transformer Block

Attention is one piece of a repeating unit. Each block pairs multi-head self-attention with a feed-forward network — a small two-layer network applied at each position that adds non-linear processing — and wraps both in “Add & Norm”: a residual connection (adding a sub-layer's input back to its output, for smooth gradient flow) followed by layer normalization (keeping values on a stable scale).^[5]^[9]

FIG. 4 — The transformer block. Attention mixes information across tokens; the feed-forward network processes each position; residual connections (dashed) and normalization keep training stable. This unit is stacked N times — six in the original paper, dozens to hundreds in modern LLMs.^[9]

Stacking & Positional Encoding

The block repeats — each layer refines the representations a little more, building from surface patterns toward abstract meaning as data flows up the stack.^[10] But because attention processes all tokens at once with no recurrence, the architecture has no inherent sense of order — “dog bites man” and “man bites dog” would look identical.^[8] The fix is to add positional encodings to the embeddings, injecting each token's position (the original used sine and cosine waves; modern models use schemes like RoPE).^[7]^[8]

Masked Attention & Decoder-Only

The original transformer had both an encoder and a decoder. Modern LLMs are decoder-only stacks that use masked (causal) self-attention: a token may attend only to tokens before it, never the ones ahead.^[5]^[9] That mask is what lets the model be trained to predict the next token without cheating by peeking at the answer.

FIG. 5 — The causal mask. Each token (row) may attend only to itself and earlier tokens (filled), never to future ones (blank). This triangular pattern is what makes next-token prediction honest.

How It All Fits Together

End to end, the architecture is a tall stack of identical blocks. Embeddings (with positions added) flow in; each block lets tokens attend to one another and then processes them position-by-position; after many layers, the final representations are used to predict the next token.

Embeddings + positions

vectors from Foundation 01

→

× N

Transformer blocks

attention + feed-forward

→

Out

Representations

context-aware vectors

→

Use

Next token

predicted from the top

FIG. 6 — The data flow. The transformer consumes the embeddings from Foundation 01, refines them through a stack of attention-plus-feed-forward blocks, and hands the top representations to next-token prediction.

The throughline: attention lets every token gather context from every other token, multi-head runs many such views at once, and the block-plus-stack structure turns that into deep, abstract understanding — with position bolted on and a causal mask keeping generation honest. This is the engine that consumes the vectors of Foundation 01 (it never sees letters) and that the training of Foundation 03 will shape. Its defining move — every token attending to every other — is precisely why cost grows quadratically with length, the root of the long-context expense in Foundation 06 and the KV-cache bottleneck you met in serving. And the contextual representations it produces are, ultimately, what every Tier-2 discipline runs on.

①

Attention is quadratic. Because every token attends to every other, compute and memory scale with the square of the sequence length — the reason long context is expensive, the reason the KV cache balloons in serving, and a major target of ongoing research (Flash Attention, RoPE, linear-attention variants).

②

The model has no innate sense of order. Position is added after the fact, not intrinsic to the architecture, which is why positional schemes are a live research area and why very long or unusually structured sequences can still confuse a model about what came where.

The Transformer & Attention The Architecture That Lets Every Word See Every Other Word