Understanding LLMs: The Core Algorithm

Tokenize text → integers
Embed integers → vectors
Attend vectors → context
Transform context → features
Predict features → next token
Tokenization converts raw text into sequences of integers that neural networks can process. The challenge is finding the right granularity: character-level is simple but produces long sequences; BPE (Byte Pair Encoding) compresses common patterns into single tokens. Every LLM starts here — the model never sees raw text, only token IDs.
Character-Level
One token per character. Simple but ~4× longer sequences.
How it works: Map each unique character to an integer. "hello" becomes [7, 4, 11, 11, 14] — five tokens for five characters.

# From Karpathy's microgpt — 5 lines of tokenizer
uchars = sorted(set(''.join(docs)))  # unique chars
BOS = len(uchars)                    # special token
vocab_size = len(uchars) + 1

# Encode: string → list of integers
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
Tradeoff: Vocab is tiny (~100 tokens) but sequences are long. A 1000-character text = 1000 tokens. Attention cost scales with sequence length squared.
BPE Algorithm
Greedily merge most frequent adjacent pairs. Statistical compression.
The algorithm:
1. Start with raw bytes (256 base tokens)
2. Count all adjacent token pairs
3. Merge the most frequent pair → new token
4. Repeat for N merges (N = vocab_size - 256)

# Core BPE loop (from no-magic/microtokenizer)
def train_bpe(ids, num_merges):
    merges = []
    for i in range(num_merges):
        counts = get_pair_counts(ids)
        pair = max(counts, key=counts.get)
        new_id = 256 + i
        ids = apply_merge(ids, pair, new_id)
        merges.append((pair, new_id))
    return merges
Key insight: BPE discovers structure without linguistic rules. Common words become single tokens; rare words split into reusable pieces. "unhappy" might become ["un", "happy"] — but this is frequency-driven, not morphological analysis.
Vocab Tradeoffs
Small vocab = longer sequences. Large vocab = bigger embedding matrix.
Tokenizer Vocab "Hello world"
Character ~100 11 tokens
GPT-2 BPE 50,257 2 tokens
GPT-4 100,256 2 tokens
GPT-4o 199,997 2 tokens
LLaMA 3 128,256 2 tokens
Sweet spot: 50K–200K tokens. Balances sequence length against embedding matrix size. Byte-level fallback guarantees open vocabulary — no "unknown token" failures.
Special Tokens
BOS, EOS, PAD — boundary markers for sequences.
BOS (Beginning of Sequence): Signals start of generation
EOS (End of Sequence): Model learns when to stop
PAD: Fills shorter sequences in batches (masked in loss)

# microgpt: BOS wraps each document
tokens = [BOS] + encode("emma") + [BOS]
# Model learns: BOS → start generating, BOS → stop
What tokenization reveals about LLMs
LLMs never see characters — they see token IDs. "strawberry" might tokenize as ["str", "aw", "berry"]. The model has no direct access to individual letters, which explains why LLMs struggle with counting letters, spelling tasks, and character-level operations. Tokenization is linguistically blind — subword boundaries rarely align with morpheme boundaries.
Embeddings map token IDs to dense vectors — a lookup table that's mathematically equivalent to one-hot × weight matrix, but efficient. Token embeddings capture distributional similarity — tokens appearing in similar contexts get similar vectors. Position embeddings tell the model where each token is in the sequence.
Token Embedding
Just indexing into a matrix. Row N = embedding for token N.
embedding = W_emb[token_id]   ≡   one_hot(token_id) @ W_emb
# Karpathy's microgpt: embedding is a 2D list
wte = [[random.gauss(0, 0.08)
        for _ in range(n_embd)]
       for _ in range(vocab_size)]
tok_emb = wte[token_id]  # that's it — just indexing
Key insight: Each row is a learnable d-dimensional vector. Initialized randomly, then updated via backpropagation to capture co-occurrence patterns from training data.
Position Embedding
Tells the model WHERE in the sequence. Learned (GPT-2) or rotational (RoPE).
Method How Used By
Learned Lookup table, same as token embedding GPT-2
Sinusoidal Fixed sin/cos at different frequencies Original Transformer
RoPE Rotate Q/K vectors by position-dependent angle LLaMA, Qwen, Gemma
# microgpt: learned position embeddings
wpe = [[random.gauss(0, 0.08)
        for _ in range(n_embd)]
       for _ in range(block_size)]
pos_emb = wpe[pos_id]
Combined Input
Sum token + position. Model learns to disentangle.
x = token_embedding + position_embedding
# The model's actual input for each token
x = [t + p for t, p in zip(tok_emb, pos_emb)]
This addition is not arbitrary — both embeddings live in the same vector space, and the model learns during training to disentangle "what token is this" from "where is it."
Weight Tying
Same matrix for input embedding and output projection. ~30% fewer params.
In GPT-2, the input embedding matrix and the output "lm_head" are the same matrix (transposed). Tokens that are similar in meaning are close in embedding space, so they naturally get similar output probabilities.

Model Embed Dim Vocab Embed Params
GPT-2 Small 768 50,257 38.6M
LLaMA 7B 4,096 32,000 131M
microgpt 16 27 688
What embeddings encode
Embeddings don't encode "meaning" — they encode distributional similarity. "king" is near "queen" because they appear in similar contexts in the training data, not because the model understands monarchy. The famous "king - man + woman ≈ queen" works because gender is a consistent direction in the training distribution.
Attention lets each position "look at" all other positions and decide which matter most. Think of it as a soft dictionary lookup: Query asks "what am I looking for?", Key answers "what do I contain?", Value provides "what to return if matched." The result is a weighted blend of all values — not a hard lookup.
Self-Attention
softmax(QK^T / √d) @ V — the core operation.
Attention(Q, K, V) = softmax( Q·KT / √dk ) · V
Step by step:
1. Q·KT — Similarity scores: every position vs every other
2. / √dk — Scale to prevent softmax saturation
3. softmax — Convert to probabilities (sum to 1 per row)
4. · V — Weighted blend of value vectors

# From no-magic/microattention.py
def attention(q, k, v):
    scale = 1.0 / math.sqrt(len(q[0]))
    scores = matmul(q, transpose(k))
    scores = [[v * scale for v in row]
              for row in scores]
    weights = [softmax(row) for row in scores]
    return matmul(weights, v)
Why √dk scaling? Without it, dot products grow proportionally to dimension (variance ≈ dk). Large values push softmax toward one-hot outputs, killing gradients.
Query, Key, Value
Three learned projections of the same input. Different roles.
Matrix Role Analogy
Query (Q) What am I looking for? Search query
Key (K) What do I contain? Document title
Value (V) What info do I provide? Document content
Q = x @ W_q  # learned projection
K = x @ W_k
V = x @ W_v
The separation allows the model to learn different representations for "what to look for" vs "what to match against" vs "what to return."
Causal Mask
Future positions get −∞ before softmax → zero weight. Can't see the future.
For autoregressive generation, position t can only attend to positions 0..t-1. We set future scores to −∞ before softmax, so exp(−∞) = 0.

# Upper triangle = future positions
mask = torch.triu(ones(n, n), diagonal=1)
scores.masked_fill_(mask, -inf)
weights = softmax(scores)  # -inf → 0

# Result: triangular attention pattern
# pos 0: sees [self]
# pos 1: sees [pos 0, self]
# pos n: sees [pos 0, ..., self]
Multi-Head Attention
Multiple attention patterns in parallel. Same FLOPs, richer representation.
Instead of one d-dimensional attention, use h heads each with d/h dimensions. Each head can specialize — one learns syntax, another semantics, another position. Same total FLOPs as single-head.

# Karpathy's microgpt: multi-head via slicing
for h in range(n_head):
    hs = h * head_dim
    q_h = q[hs:hs+head_dim]
    k_h = [ki[hs:hs+head_dim] for ki in keys]
    v_h = [vi[hs:hs+head_dim] for vi in values]
    # attention per head, concatenate all heads
W_O projection after concatenation is where cross-head mixing happens — combining signals from all attention patterns.
Variant KV Heads Memory Used By
MHA = Q heads Highest GPT-2
GQA Q/group Middle LLaMA 2/3
MQA 1 Lowest Falcon
The Feed-Forward Network processes each position independently: expand → activate → contract. Often called the "memory" of the transformer — factual knowledge is stored in these weights. Residual connections (x + f(x)) create a gradient highway that makes deep networks trainable.
FFN Structure
Two linear layers with 4× expansion and activation in between.
FFN(x) = W2 · activation( W1 · x )
# Karpathy's microgpt: MLP block
x = linear(x, mlp_fc1)                # [d] → [4d] expand
x = [xi.relu() for xi in x]  # activate
x = linear(x, mlp_fc2)                # [4d] → [d] contract
Why 4× expansion? More capacity to compute complex functions without widening the residual stream. The model temporarily expands to a higher-dimensional space, processes, then compresses back.
Activations
ReLU → GELU → SwiGLU: evolution of nonlinearity.
Activation Formula Used By
ReLU max(0, x) Educational, older models
GELU x · Φ(x) ≈ smooth ReLU GPT-2, GPT-3, BERT
SiLU/Swish x · sigmoid(x) Newer models
SwiGLU (xW₁ ⊙ Swish(xV))W₂ LLaMA 2/3, PaLM
All serve the same purpose: introduce nonlinearity so the network can approximate arbitrary functions. Without activation, stacking linear layers collapses to a single linear transformation.
Residual Connections
x + f(x) — gradient highway. Makes deep networks trainable.
x = x + attention(norm(x))
x = x + ffn(norm(x))
Without residuals, gradients must pass through every layer's transformations — they vanish in deep networks. The residual "highway" lets gradients flow directly backward unimpeded.
# Both attention and FFN are wrapped in residuals
x_residual = x
x = rmsnorm(x)
x = attention(x)
x = [a + b for a, b in zip(x, x_residual)]  # add skip
Pre-Norm vs Post-Norm
Normalize BEFORE transformation (modern). Cleaner residual path.
Post-Norm (2017 Transformer): x = Norm(x + f(x))
Pre-Norm (GPT-2+): x = x + f(Norm(x))

Pre-norm keeps the residual path clean — no normalization layer on the skip connection. More stable training. All modern LLMs use pre-norm, typically RMSNorm instead of LayerNorm.

# RMSNorm (simpler than LayerNorm — no mean subtraction)
def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]
Attention vs FFN: Division of labor
Attention routes information between tokens ("who talks to whom"). FFN processes information within each token ("what to say"). Interpretability research shows factual knowledge ("Paris is the capital of France") is stored in FFN weights, while attention handles the contextual routing. Together, one transformer block combines context-dependent retrieval with position-independent computation.
Training is simple: predict next token → compute loss → backpropagate → update weights. Cross-entropy loss measures "surprise" at the correct answer. Adam combines momentum with adaptive learning rates. This loop, repeated billions of times with different data, compresses the statistical patterns of language into model weights.
Softmax
Converts logits to probabilities that sum to 1.
softmax(xi) = exp(xi) / Σ exp(xj)
# Numerically stable softmax (subtract max first)
def softmax(logits):
    max_val = max(v.data for v in logits)
    exps = [(v - max_val).exp() for v in logits]
    total = sum(exps)
    return [e / total for e in exps]
Subtract max for stability: Without this, large logits cause exp() to overflow to infinity. The subtraction doesn't change the output probabilities.
Cross-Entropy Loss
−log(p[correct]). High penalty for confident wrong answers.
loss = −log( P(correct token) )
If probability is 0.9 → loss = 0.105 (small, correct)
If probability is 0.01 → loss = 4.6 (large, wrong)

The log makes the penalty grow increasingly severe as confidence in the wrong answer increases. This is the only training signal — predict the next token better.
# The loss for one token prediction
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss = -probs[target_id].log()
Backpropagation
Chain rule from loss to every parameter. Each param gets a gradient.
Starting from the loss, work backward through every operation: for each parameter, compute "if I nudge this value, how much does the loss change?"

# Karpathy's autograd backward pass
def backward(self):
    topo = []  # topological sort of computation graph
    build_topo(self)
    self.grad = 1  # seed: dloss/dloss = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad  # chain rule
6 primitive operations with known derivatives — add, mul, pow, exp, log, relu — compose to compute gradients through any neural network.
Adam Optimizer
Momentum + adaptive rates. Bias correction for early steps.
m = β₁·m + (1-β₁)·∇L   (momentum)
v = β₂·v + (1-β₂)·∇L²   (variance)
θ = θ − lr · m̂ / √(v̂ + ε)   (update)
# Complete Adam from Karpathy's microgpt
for i, p in enumerate(params):
    m[i] = beta1 * m[i] + (1 - beta1) * p.grad
    v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
    m_hat = m[i] / (1 - beta1 ** (step + 1))  # bias correction
    v_hat = v[i] / (1 - beta2 ** (step + 1))
    p.data -= lr * m_hat / (v_hat ** 0.5 + eps)
Momentum averages past gradients → consistent direction. Adaptive scaling divides by gradient variance → big gradients get small steps, small gradients get big steps. Bias correction fixes the cold-start problem when m and v are initialized at 0.
The Training Loop
Forward → loss → backward → update. Repeat billions of times.
# The ENTIRE learning algorithm
for step in range(num_steps):
    # 1. Forward: compute predictions
    logits = model(input_tokens)
    # 2. Loss: measure wrongness
    loss = cross_entropy(logits, target_tokens)
    # 3. Backward: compute gradients
    loss.backward()
    # 4. Update: nudge weights toward less loss
    optimizer.step()
    optimizer.zero_grad()
This 4-line loop, repeated with different data, is how every LLM learns. GPT-4 and microgpt use the same algorithm — only scale differs.
Learning Rate Schedule
Warmup then decay. Start cautious, push hard, then refine.
Warmup: Start small because early gradients are noisy. Ramp up over first ~1% of training.
Cosine decay: Gradually reduce to near zero. As you approach the minimum, large steps overshoot.

# Karpathy's microgpt uses simple linear decay
lr_t = learning_rate * (1 - step / num_steps)

# Production models use warmup + cosine
if step < warmup_steps:
    lr = peak_lr * step / warmup_steps
else:
    progress = (step - warmup_steps) / (total - warmup_steps)
    lr = peak_lr * 0.5 * (1 + cos(pi * progress))
What training reveals
The training objective is next-token prediction — a purely statistical task. By predicting what comes next in trillions of tokens, the model is forced to learn grammar (to predict syntax), facts (to predict entities), and patterns that resemble reasoning (to predict logical conclusions). All of it compressed into weight matrices through gradient descent.
Educational sources and inspiration:
microgpt.py by Andrej Karpathy — "This file is the complete algorithm. Everything else is just efficiency."
no-magic by Mathews-Tom — 41 algorithms from scratch, zero dependencies
LLMs-from-scratch by Sebastian Raschka — Build a Large Language Model (Manning, 2024)
Stanford CS336 — Language Modeling from Scratch (Hashimoto & Liang)
Stanford CME 295 — Transformers & Large Language Models (Amidi & Amidi) | cheatsheet