This note covers the first foundation: how raw text becomes the numbers a model actually operates on. It happens in two steps — tokenization (splitting text into units and mapping each to an integer ID) and embedding (turning each ID into a vector that places its meaning in a high-dimensional space). This is the input layer beneath everything else: the transformer in Foundation 02 never sees letters, only these vectors. It excludes what happens afterward — the architecture, training, and generation — which the later foundations cover.
A neural network does arithmetic on numbers, not letters. So before a model can touch language, language has to become numbers — and not just any numbers, numbers that carry meaning. Two obvious schemes both fail. Give every whole word its own ID and the vocabulary explodes, while any word the model never saw simply breaks it — the out-of-vocabulary problem.[1] Give every character its own ID and you avoid that, but sequences become very long and each unit carries almost no meaning on its own.[3]
And even once you have IDs, a bare integer like token #4921 tells the model nothing about what that token means or how it relates to any other. So the input layer has two jobs, and this note is about both: chop text into good units (tokenization), and represent each unit so that similar meanings land close together (embedding).
Modern models split the difference with subword tokenization: keep frequent words whole, and break rare words into meaningful pieces.[6] So boy stays one token while boys becomes boy + s — letting the model see that related words share a root, and letting it handle words it never saw in training by assembling them from parts.[6][2] This is the standard across modern LLMs.[3]
The most common subword scheme is Byte-Pair Encoding, borrowed from data compression.[4] Training the tokenizer starts with every character as a token, then repeatedly finds the most frequent adjacent pair and merges it into a new token, repeating until a target vocabulary size is reached.[5] Early merges produce two-character tokens; later ones grow into longer subwords and whole common words.[5] Vocabulary size is itself a tradeoff — larger captures more nuance but costs more compute, smaller is leaner but coarser.[4]
est and low. Common patterns become single tokens; the same pieces recombine for words like “lower” or “slowest.” Illustrative.Turning text into IDs is encoding; turning IDs back into text is decoding.[2] A crucial consequence follows: token count is not word count.[4] A model's context window and its API pricing are both measured in tokens, and the tokenizer is also where a class of odd failures originates — it can split words in unintuitive ways, especially for irregular spelling or morphologically rich languages.[4]
A token ID is still meaningless on its own. The naive fix, one-hot encoding — a vector of zeros with a single 1 at the token's index — carries no meaning at all: “cat” and “dog” come out perfectly orthogonal despite being related, every vector is as long as the entire vocabulary, and nothing generalizes.[7] Instead, each ID indexes into a learned embedding matrix that maps it to a dense vector — a few hundred to a few thousand numbers (GPT-2 used 768; modern embedding models span roughly 256 to 4096).[7][10]
Those vectors are learned during training: as the model practices predicting words in context, it nudges the vectors so that words appearing in similar contexts end up near each other.[10][8] The result is a space where geometry encodes meaning — “Russia” lands near “Moscow,” “cat” near “feline,” far from “banana” — and similarity is measured by the angle between vectors (cosine similarity).[8][9] One caution: the individual dimensions are not human-interpretable; they are learned without labels, so meaning lives in the relative positions, not in any single number.[9]
Early embeddings (Word2Vec, GloVe) gave each word a single fixed vector. Inside a transformer the representation becomes contextual — the same token gets a different vector depending on the words around it, so “bank” by a river and “bank” with money diverge.[10] That contextualizing is the job of the next foundation: the architecture itself.
The whole input layer is a short, fixed pipeline. Text is split into tokens, each token is looked up as an integer ID, and each ID is mapped through the embedding matrix into a vector. Only the vectors go forward.
The throughline: tokenization decides what the model perceives as a unit, and embedding decides how much meaning that unit arrives with. This layer sits beneath the entire stack — the transformer of Foundation 02 never sees letters, only these vectors. And the same idea echoes upward through the disciplines: the vector search at the heart of retrieval is similarity measured in exactly this kind of space, and the token is the unit that inference and serving counts, caches, and bills. It even explains familiar model failures — miscounting the letters in a word, fumbling digit-by-digit arithmetic, or costing more to process some languages than others — because the model sees chunks, not characters.
The tokenizer is fixed, invisible, and shapes everything. It is chosen before training and never changes, yet it silently sets context cost, multilingual fairness, and a whole class of failures (letter-counting, spelling, arithmetic). When a model does something baffling with characters, suspect the tokenizer first — and remember that token ≠ word.
Don't over-read the dimensions. An embedding's individual numbers are not labeled features; meaning is in the relative geometry. The useful question is never “what does dimension 412 mean,” but “what sits near this vector, and in which direction.”