Tokenization & Embeddings: How Language Becomes Numbers

TL;DR — Scope

This note covers the first foundation: how raw text becomes the numbers a model actually operates on. It happens in two steps — tokenization (splitting text into units and mapping each to an integer ID) and embedding (turning each ID into a vector that places its meaning in a high-dimensional space). This is the input layer beneath everything else: the transformer in Foundation 02 never sees letters, only these vectors. It excludes what happens afterward — the architecture, training, and generation — which the later foundations cover.

The Problem

A neural network does arithmetic on numbers, not letters. So before a model can touch language, language has to become numbers — and not just any numbers, numbers that carry meaning. Two obvious schemes both fail. Give every whole word its own ID and the vocabulary explodes, while any word the model never saw simply breaks it — the out-of-vocabulary problem.^[1] Give every character its own ID and you avoid that, but sequences become very long and each unit carries almost no meaning on its own.^[3]

And even once you have IDs, a bare integer like token #4921 tells the model nothing about what that token means or how it relates to any other. So the input layer has two jobs, and this note is about both: chop text into good units (tokenization), and represent each unit so that similar meanings land close together (embedding).

FIG. 1 — Three ways to split one word. Word-level chokes on rare words; character-level loses meaning and bloats length; subword keeps common words whole and breaks rare ones into reusable pieces — the modern default.

The Concepts

Why Subwords

Modern models split the difference with subword tokenization: keep frequent words whole, and break rare words into meaningful pieces.^[6] So boy stays one token while boys becomes boy + s — letting the model see that related words share a root, and letting it handle words it never saw in training by assembling them from parts.^[6]^[2] This is the standard across modern LLMs.^[3]

Byte-Pair Encoding

The most common subword scheme is Byte-Pair Encoding, borrowed from data compression.^[4] Training the tokenizer starts with every character as a token, then repeatedly finds the most frequent adjacent pair and merges it into a new token, repeating until a target vocabulary size is reached.^[5] Early merges produce two-character tokens; later ones grow into longer subwords and whole common words.^[5] Vocabulary size is itself a tradeoff — larger captures more nuance but costs more compute, smaller is leaner but coarser.^[4]

FIG. 2 — BPE in action. Starting from characters, the most frequent adjacent pairs are merged step by step, growing reusable chunks like est and low. Common patterns become single tokens; the same pieces recombine for words like “lower” or “slowest.” Illustrative.

Tokens, Not Words

Turning text into IDs is encoding; turning IDs back into text is decoding.^[2] A crucial consequence follows: token count is not word count.^[4] A model's context window and its API pricing are both measured in tokens, and the tokenizer is also where a class of odd failures originates — it can split words in unintuitive ways, especially for irregular spelling or morphologically rich languages.^[4]

From IDs to Vectors — the Embedding Layer

A token ID is still meaningless on its own. The naive fix, one-hot encoding — a vector of zeros with a single 1 at the token's index — carries no meaning at all: “cat” and “dog” come out perfectly orthogonal despite being related, every vector is as long as the entire vocabulary, and nothing generalizes.^[7] Instead, each ID indexes into a learned embedding matrix that maps it to a dense vector — a few hundred to a few thousand numbers (GPT-2 used 768; modern embedding models span roughly 256 to 4096).^[7]^[10]

FIG. 3 — One-hot vs dense embedding. One-hot is a sparse identity tag the size of the whole vocabulary, with “cat” and “dog” as unrelated as any two words. A learned embedding is compact and carries meaning — related words get near-identical vectors.

The Semantic Space

Those vectors are learned during training: as the model practices predicting words in context, it nudges the vectors so that words appearing in similar contexts end up near each other.^[10]^[8] The result is a space where geometry encodes meaning — “Russia” lands near “Moscow,” “cat” near “feline,” far from “banana” — and similarity is measured by the angle between vectors (cosine similarity).^[8]^[9] One caution: the individual dimensions are not human-interpretable; they are learned without labels, so meaning lives in the relative positions, not in any single number.^[9]

FIG. 4 — The semantic space. Related words cluster; unrelated ones sit far apart; and relationships show up as consistent directions (man→king runs parallel to woman→queen). Meaning is in the geometry, not in any one coordinate. Illustrative 2-D projection.

Static vs Contextual

Early embeddings (Word2Vec, GloVe) gave each word a single fixed vector. Inside a transformer the representation becomes contextual — the same token gets a different vector depending on the words around it, so “bank” by a river and “bank” with money diverge.^[10] That contextualizing is the job of the next foundation: the architecture itself.

How It All Fits Together

The whole input layer is a short, fixed pipeline. Text is split into tokens, each token is looked up as an integer ID, and each ID is mapped through the embedding matrix into a vector. Only the vectors go forward.

Text

“lowest”

→

Step 1

Tokenize

low · est

→

Step 2

Token IDs

2606 · 4905

→

Step 3

Embed

lookup → vectors

→

Out

Vectors

into the transformer

FIG. 5 — The input pipeline. Tokenization decides the units; the embedding lookup gives each unit a meaningful vector. Everything downstream operates on these vectors, never on the original letters.

The throughline: tokenization decides what the model perceives as a unit, and embedding decides how much meaning that unit arrives with. This layer sits beneath the entire stack — the transformer of Foundation 02 never sees letters, only these vectors. And the same idea echoes upward through the disciplines: the vector search at the heart of retrieval is similarity measured in exactly this kind of space, and the token is the unit that inference and serving counts, caches, and bills. It even explains familiar model failures — miscounting the letters in a word, fumbling digit-by-digit arithmetic, or costing more to process some languages than others — because the model sees chunks, not characters.

①

The tokenizer is fixed, invisible, and shapes everything. It is chosen before training and never changes, yet it silently sets context cost, multilingual fairness, and a whole class of failures (letter-counting, spelling, arithmetic). When a model does something baffling with characters, suspect the tokenizer first — and remember that token ≠ word.

②

Don't over-read the dimensions. An embedding's individual numbers are not labeled features; meaning is in the relative geometry. The useful question is never “what does dimension 412 mean,” but “what sits near this vector, and in which direction.”

Tokenization & Embeddings How Language Becomes Numbers a Model Can Reason Over