Tokenization converts raw text into sequences of integers that neural networks can
process. The challenge is finding the right
granularity: character-level is simple but produces long sequences;
BPE (Byte Pair Encoding) compresses common patterns into single tokens.
Every LLM starts here
— the model never sees raw text, only token IDs.
Character-Level
▼
One token per character. Simple but ~4× longer sequences.
How it works: Map each unique character to an integer. "hello" becomes [7, 4,
11, 11, 14] — five tokens for five characters.
# From Karpathy's microgpt — 5 lines of tokenizer
uchars = sorted(set(''.join(docs))) # unique chars
BOS = len(uchars) # special token
vocab_size = len(uchars) + 1
# Encode: string → list of integers
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
Tradeoff: Vocab is tiny (~100 tokens) but sequences are long. A 1000-character
text = 1000 tokens. Attention cost scales with sequence length squared.
BPE Algorithm
▼
Greedily merge most frequent adjacent pairs. Statistical compression.
The algorithm:
1. Start with raw bytes (256 base tokens)
2. Count all adjacent token pairs
3. Merge the most frequent pair → new token
4. Repeat for N merges (N = vocab_size - 256)
# Core BPE loop (from no-magic/microtokenizer)
def train_bpe(ids, num_merges):
merges = []
for i in range(num_merges):
counts = get_pair_counts(ids)
pair = max(counts, key=counts.get)
new_id = 256 + i
ids = apply_merge(ids, pair, new_id)
merges.append((pair, new_id))
return merges
Key insight: BPE discovers structure without linguistic rules. Common words
become single tokens; rare words split into reusable pieces. "unhappy" might become
["un", "happy"] — but this is frequency-driven, not morphological analysis.
Vocab Tradeoffs
▼
Small vocab = longer sequences. Large vocab = bigger embedding matrix.
| Tokenizer |
Vocab |
"Hello world" |
| Character |
~100 |
11 tokens |
| GPT-2 BPE |
50,257 |
2 tokens |
| GPT-4 |
100,256 |
2 tokens |
| GPT-4o |
199,997 |
2 tokens |
| LLaMA 3 |
128,256 |
2 tokens |
Sweet spot: 50K–200K tokens. Balances sequence length against embedding matrix
size. Byte-level fallback guarantees open vocabulary — no "unknown token" failures.
Special Tokens
▼
BOS, EOS, PAD — boundary markers for sequences.
BOS (Beginning of Sequence): Signals start of generation
EOS (End of Sequence): Model learns when to stop
PAD: Fills shorter sequences in batches (masked in loss)
# microgpt: BOS wraps each document
tokens = [BOS] + encode("emma") + [BOS]
# Model learns: BOS → start generating, BOS → stop
What tokenization reveals about LLMs
LLMs never see characters — they see token IDs. "strawberry" might tokenize as ["str",
"aw", "berry"]. The model has no direct access to individual letters, which
explains why LLMs struggle with counting letters, spelling tasks, and character-level
operations. Tokenization is linguistically blind — subword boundaries rarely align with
morpheme boundaries.
Embeddings map token IDs to dense vectors — a
lookup table that's mathematically equivalent to one-hot × weight matrix, but
efficient. Token embeddings capture
distributional similarity — tokens appearing in similar
contexts get similar vectors. Position embeddings tell the model where each token
is in the sequence.
Token Embedding
▼
Just indexing into a matrix. Row N = embedding for token N.
embedding = W_emb[token_id] ≡ one_hot(token_id) @ W_emb
# Karpathy's microgpt: embedding is a 2D list
wte = [[random.gauss(0, 0.08)
for _ in range(n_embd)]
for _ in range(vocab_size)]
tok_emb = wte[token_id] # that's it — just indexing
Key insight: Each row is a learnable d-dimensional vector. Initialized
randomly, then updated via backpropagation to capture co-occurrence patterns from
training data.
Position Embedding
▼
Tells the model WHERE in the sequence. Learned (GPT-2) or rotational (RoPE).
| Method |
How |
Used By |
| Learned |
Lookup table, same as token embedding |
GPT-2 |
| Sinusoidal |
Fixed sin/cos at different frequencies |
Original Transformer |
| RoPE |
Rotate Q/K vectors by position-dependent angle |
LLaMA, Qwen, Gemma |
# microgpt: learned position embeddings
wpe = [[random.gauss(0, 0.08)
for _ in range(n_embd)]
for _ in range(block_size)]
pos_emb = wpe[pos_id]
Combined Input
▼
Sum token + position. Model learns to disentangle.
x = token_embedding + position_embedding
# The model's actual input for each token
x = [t + p for t, p in zip(tok_emb, pos_emb)]
This addition is not arbitrary — both embeddings live in the same vector space, and
the model learns during training to disentangle "what token is this" from "where is
it."
Weight Tying
▼
Same matrix for input embedding and output projection. ~30% fewer params.
In GPT-2, the input embedding matrix and the output "lm_head" are
the same matrix (transposed). Tokens that are similar in meaning are close in
embedding space, so they naturally get similar output probabilities.
| Model |
Embed Dim |
Vocab |
Embed Params |
| GPT-2 Small |
768 |
50,257 |
38.6M |
| LLaMA 7B |
4,096 |
32,000 |
131M |
| microgpt |
16 |
27 |
688 |
What embeddings encode
Embeddings don't encode "meaning" — they encode
distributional similarity. "king" is near "queen" because they appear in similar
contexts in the training data, not because the model understands monarchy. The famous
"king - man + woman ≈ queen" works because gender is a consistent direction in the
training distribution.
Attention lets each position "look at" all other positions and decide which matter most.
Think of it as a
soft dictionary lookup: Query asks "what am I looking for?",
Key answers "what do I contain?", Value provides "what to return if matched." The result
is a weighted blend of all values — not a hard lookup.
Self-Attention
▼
softmax(QK^T / √d) @ V — the core operation.
Attention(Q, K, V) = softmax( Q·KT / √dk
) · V
Step by step:
1.
Q·KT — Similarity scores: every position vs every other
2.
/ √dk — Scale to prevent softmax saturation
3.
softmax — Convert to probabilities (sum to 1 per row)
4.
· V — Weighted blend of value vectors
# From no-magic/microattention.py
def attention(q, k, v):
scale = 1.0 / math.sqrt(len(q[0]))
scores = matmul(q, transpose(k))
scores = [[v * scale for v in row]
for row in scores]
weights = [softmax(row) for row in scores]
return matmul(weights, v)
Why √dk scaling? Without it, dot products grow proportionally to
dimension (variance ≈ d
k). Large values push softmax toward one-hot
outputs, killing gradients.
Query, Key, Value
▼
Three learned projections of the same input. Different roles.
| Matrix |
Role |
Analogy |
| Query (Q) |
What am I looking for? |
Search query |
| Key (K) |
What do I contain? |
Document title |
| Value (V) |
What info do I provide? |
Document content |
Q = x @ W_q # learned projection
K = x @ W_k
V = x @ W_v
The separation allows the model to learn
different representations for "what to
look for" vs "what to match against" vs "what to return."
Causal Mask
▼
Future positions get −∞ before softmax → zero weight. Can't see the future.
For autoregressive generation, position t can only attend to positions 0..t-1. We set
future scores to
−∞ before softmax, so exp(−∞) = 0.
# Upper triangle = future positions
mask = torch.triu(ones(n, n), diagonal=1)
scores.masked_fill_(mask, -inf)
weights = softmax(scores) # -inf → 0
# Result: triangular attention pattern
# pos 0: sees [self]
# pos 1: sees [pos 0, self]
# pos n: sees [pos 0, ..., self]
Multi-Head Attention
▼
Multiple attention patterns in parallel. Same FLOPs, richer representation.
Instead of one d-dimensional attention, use h heads each with d/h dimensions. Each
head can specialize — one learns syntax, another semantics, another position.
Same total FLOPs as single-head.
# Karpathy's microgpt: multi-head via slicing
for h in range(n_head):
hs = h * head_dim
q_h = q[hs:hs+head_dim]
k_h = [ki[hs:hs+head_dim] for ki in keys]
v_h = [vi[hs:hs+head_dim] for vi in values]
# attention per head, concatenate all heads
W_O projection after concatenation is where cross-head mixing happens —
combining signals from all attention patterns.
| Variant |
KV Heads |
Memory |
Used By |
| MHA |
= Q heads |
Highest |
GPT-2 |
| GQA |
Q/group |
Middle |
LLaMA 2/3 |
| MQA |
1 |
Lowest |
Falcon |
The Feed-Forward Network processes each position
independently: expand → activate → contract. Often called the
"memory" of the transformer — factual knowledge is stored in
these weights. Residual connections (x + f(x)) create a gradient highway that makes
deep networks trainable.
FFN Structure
▼
Two linear layers with 4× expansion and activation in between.
FFN(x) = W2 · activation( W1 · x )
# Karpathy's microgpt: MLP block
x = linear(x, mlp_fc1) # [d] → [4d] expand
x = [xi.relu() for xi in x] # activate
x = linear(x, mlp_fc2) # [4d] → [d] contract
Why 4× expansion? More capacity to compute complex functions without widening
the residual stream. The model temporarily expands to a higher-dimensional space,
processes, then compresses back.
Activations
▼
ReLU → GELU → SwiGLU: evolution of nonlinearity.
| Activation |
Formula |
Used By |
| ReLU |
max(0, x) |
Educational, older models |
| GELU |
x · Φ(x) ≈ smooth ReLU |
GPT-2, GPT-3, BERT |
| SiLU/Swish |
x · sigmoid(x) |
Newer models |
| SwiGLU |
(xW₁ ⊙ Swish(xV))W₂ |
LLaMA 2/3, PaLM |
All serve the same purpose: introduce
nonlinearity so the network can
approximate arbitrary functions. Without activation, stacking linear layers collapses
to a single linear transformation.
Residual Connections
▼
x + f(x) — gradient highway. Makes deep networks trainable.
x = x + attention(norm(x))
x = x + ffn(norm(x))
Without residuals, gradients must pass through every layer's transformations — they
vanish in deep networks. The residual "highway" lets gradients flow
directly backward unimpeded.
# Both attention and FFN are wrapped in residuals
x_residual = x
x = rmsnorm(x)
x = attention(x)
x = [a + b for a, b in zip(x, x_residual)] # add skip
Pre-Norm vs Post-Norm
▼
Normalize BEFORE transformation (modern). Cleaner residual path.
Post-Norm (2017 Transformer): x = Norm(x + f(x))
Pre-Norm (GPT-2+): x = x + f(Norm(x))
Pre-norm keeps the residual path clean — no normalization layer on the skip
connection. More stable training. All modern LLMs use pre-norm, typically
RMSNorm instead of LayerNorm.
# RMSNorm (simpler than LayerNorm — no mean subtraction)
def rmsnorm(x):
ms = sum(xi * xi for xi in x) / len(x)
scale = (ms + 1e-5) ** -0.5
return [xi * scale for xi in x]
Attention vs FFN: Division of labor
Attention routes information between tokens ("who talks to whom").
FFN processes information within each token ("what to say"). Interpretability
research shows factual knowledge ("Paris is the capital of France") is stored in FFN
weights, while attention handles the contextual routing. Together, one transformer block
combines context-dependent retrieval with position-independent computation.
Training is simple:
predict next token → compute loss → backpropagate → update weights. Cross-entropy
loss measures "surprise" at the correct answer. Adam combines momentum with adaptive
learning rates. This loop, repeated billions of times with different data, compresses the
statistical patterns of language into model weights.
Softmax
▼
Converts logits to probabilities that sum to 1.
softmax(xi) = exp(xi) / Σ exp(xj)
# Numerically stable softmax (subtract max first)
def softmax(logits):
max_val = max(v.data for v in logits)
exps = [(v - max_val).exp() for v in logits]
total = sum(exps)
return [e / total for e in exps]
Subtract max for stability: Without this, large logits cause exp() to overflow
to infinity. The subtraction doesn't change the output probabilities.
Cross-Entropy Loss
▼
−log(p[correct]). High penalty for confident wrong answers.
loss = −log( P(correct token) )
If probability is 0.9 → loss = 0.105 (small, correct)
If probability is 0.01 → loss = 4.6 (large, wrong)
The log makes the penalty grow
increasingly severe as confidence in the wrong answer increases. This is the
only training signal — predict the next token better.
# The loss for one token prediction
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss = -probs[target_id].log()
Backpropagation
▼
Chain rule from loss to every parameter. Each param gets a gradient.
Starting from the loss, work backward through every operation: for each parameter,
compute "if I nudge this value, how much does the loss change?"
# Karpathy's autograd backward pass
def backward(self):
topo = [] # topological sort of computation graph
build_topo(self)
self.grad = 1 # seed: dloss/dloss = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.grad # chain rule
6 primitive operations with known derivatives — add, mul, pow, exp, log, relu —
compose to compute gradients through any neural network.
Adam Optimizer
▼
Momentum + adaptive rates. Bias correction for early steps.
m = β₁·m + (1-β₁)·∇L (momentum)
v = β₂·v + (1-β₂)·∇L² (variance)
θ = θ − lr · m̂ / √(v̂ + ε) (update)
# Complete Adam from Karpathy's microgpt
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1)) # bias correction
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr * m_hat / (v_hat ** 0.5 + eps)
Momentum averages past gradients → consistent direction.
Adaptive scaling divides by gradient variance → big gradients get small steps,
small gradients get big steps.
Bias correction fixes the cold-start problem
when m and v are initialized at 0.
The Training Loop
▼
Forward → loss → backward → update. Repeat billions of times.
# The ENTIRE learning algorithm
for step in range(num_steps):
# 1. Forward: compute predictions
logits = model(input_tokens)
# 2. Loss: measure wrongness
loss = cross_entropy(logits, target_tokens)
# 3. Backward: compute gradients
loss.backward()
# 4. Update: nudge weights toward less loss
optimizer.step()
optimizer.zero_grad()
This 4-line loop, repeated with different data, is how
every LLM learns. GPT-4 and microgpt use the same algorithm — only scale
differs.
Learning Rate Schedule
▼
Warmup then decay. Start cautious, push hard, then refine.
Warmup: Start small because early gradients are noisy. Ramp up over first ~1%
of training.
Cosine decay: Gradually reduce to near zero. As you approach the minimum, large
steps overshoot.
# Karpathy's microgpt uses simple linear decay
lr_t = learning_rate * (1 - step / num_steps)
# Production models use warmup + cosine
if step < warmup_steps:
lr = peak_lr * step / warmup_steps
else:
progress = (step - warmup_steps) / (total - warmup_steps)
lr = peak_lr * 0.5 * (1 + cos(pi * progress))
What training reveals
The training objective is next-token prediction — a purely statistical task. By
predicting what comes next in trillions of tokens, the model is forced to learn grammar
(to predict syntax), facts (to predict entities), and patterns that resemble reasoning (to
predict logical conclusions). All of it compressed into weight matrices through gradient
descent.
Educational sources and inspiration:
microgpt.py
by Andrej Karpathy — "This file is the complete algorithm. Everything else is just
efficiency."
no-magic by Mathews-Tom — 41
algorithms from scratch, zero dependencies
LLMs-from-scratch
by Sebastian Raschka — Build a Large Language Model (Manning, 2024)
Stanford CS336 — Language Modeling from Scratch
(Hashimoto & Liang)
Stanford CME 295 — Transformers & Large
Language Models (Amidi & Amidi) |
cheatsheet