This note covers the architecture itself: how the model processes the embedding vectors from Foundation 01. The transformer is the structure; attention is its core operation — letting every token look at every other token and pull in what's relevant. It spans self-attention (query/key/value), multi-head attention, the transformer block (attention + feed-forward + residual + normalization), how blocks stack, positional encoding, and the masked, decoder-only design modern LLMs use. It excludes how the model is trained (Foundation 03) and how text became vectors in the first place (Foundation 01).
Before transformers, the best sequence models were recurrent networks — RNNs and LSTMs. They read a sentence the way you might read aloud: one token at a time, left to right, carrying a running memory of what came before.[1] That design has two crippling limits. It can't be parallelized — each step waits on the previous one, so training is slow.[10] And long-range dependencies fade: by the time the model is hundreds of tokens deep, a detail from the start has washed out (the vanishing-gradient problem).[10]
The 2017 paper “Attention Is All You Need” replaced recurrence entirely with self-attention, letting every token look at every other token directly and in parallel.[1][2] The result was a massive training speedup and a far better grip on long-range relationships. The price — which the later foundations return to — is that word order is no longer built in, and compute grows with the square of the sequence length.
This is the core operation. Each token's embedding is transformed, through three learned weight matrices, into three vectors: a Query (what am I looking for), a Key (what do I offer), and a Value (the content I'd contribute).[4][3] To update a token, the model compares its query against every token's key to get similarity scores, turns those scores into weights, and produces a weighted blend of all the values.[3] The token's new representation is therefore a context-aware mixture — it has literally pulled in information from the tokens most relevant to it.
A single attention operation yields one weighted average, which blurs many kinds of relationship together.[6] So transformers run several attention operations in parallel — heads — each with its own query/key/value matrices, each free to focus on a different kind of relationship (one head might track grammar, another a long-range reference).[5][6] Their outputs are combined into the token's final representation.[4]
Attention is one piece of a repeating unit. Each block pairs multi-head self-attention with a feed-forward network — a small two-layer network applied at each position that adds non-linear processing — and wraps both in “Add & Norm”: a residual connection (adding a sub-layer's input back to its output, for smooth gradient flow) followed by layer normalization (keeping values on a stable scale).[5][9]
The block repeats — each layer refines the representations a little more, building from surface patterns toward abstract meaning as data flows up the stack.[10] But because attention processes all tokens at once with no recurrence, the architecture has no inherent sense of order — “dog bites man” and “man bites dog” would look identical.[8] The fix is to add positional encodings to the embeddings, injecting each token's position (the original used sine and cosine waves; modern models use schemes like RoPE).[7][8]
The original transformer had both an encoder and a decoder. Modern LLMs are decoder-only stacks that use masked (causal) self-attention: a token may attend only to tokens before it, never the ones ahead.[5][9] That mask is what lets the model be trained to predict the next token without cheating by peeking at the answer.
End to end, the architecture is a tall stack of identical blocks. Embeddings (with positions added) flow in; each block lets tokens attend to one another and then processes them position-by-position; after many layers, the final representations are used to predict the next token.
The throughline: attention lets every token gather context from every other token, multi-head runs many such views at once, and the block-plus-stack structure turns that into deep, abstract understanding — with position bolted on and a causal mask keeping generation honest. This is the engine that consumes the vectors of Foundation 01 (it never sees letters) and that the training of Foundation 03 will shape. Its defining move — every token attending to every other — is precisely why cost grows quadratically with length, the root of the long-context expense in Foundation 06 and the KV-cache bottleneck you met in serving. And the contextual representations it produces are, ultimately, what every Tier-2 discipline runs on.
Attention is quadratic. Because every token attends to every other, compute and memory scale with the square of the sequence length — the reason long context is expensive, the reason the KV cache balloons in serving, and a major target of ongoing research (Flash Attention, RoPE, linear-attention variants).
The model has no innate sense of order. Position is added after the fact, not intrinsic to the architecture, which is why positional schemes are a live research area and why very long or unusually structured sequences can still confuse a model about what came where.