Understanding LLMs | The Thinking Machine That Doesn't Think

Tokenization converts raw text into sequences of integers that neural networks can process. The challenge is finding the right granularity: character-level is simple but produces long sequences; BPE (Byte Pair Encoding) compresses common patterns into single tokens. Every LLM starts here - the model never sees raw text, only token IDs.

Character-Level ▼

One token per character. Simple but ~4× longer sequences.

How it works: Map each unique character to an integer. "hello" becomes [7, 4, 11, 11, 14] - five tokens for five characters.

# From Karpathy's microgpt - 5 lines of tokenizer
uchars = sorted(set(''.join(docs)))  # unique chars
BOS = len(uchars)                    # special token
vocab_size = len(uchars) + 1

# Encode: string → list of integers
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

Tradeoff: Vocab is tiny (~100 tokens) but sequences are long. A 1000-character text = 1000 tokens. Attention cost scales with sequence length squared.

BPE Algorithm ▼

Greedily merge most frequent adjacent pairs. Statistical compression.

The algorithm:
1. Start with raw bytes (256 base tokens)
2. Count all adjacent token pairs
3. Merge the most frequent pair → new token
4. Repeat for N merges (N = vocab_size - 256)

# Core BPE loop (from no-magic/microtokenizer)
def train_bpe(ids, num_merges):
    merges = []
    for i in range(num_merges):
        counts = get_pair_counts(ids)
        pair = max(counts, key=counts.get)
        new_id = 256 + i
        ids = apply_merge(ids, pair, new_id)
        merges.append((pair, new_id))
    return merges

Key insight: BPE discovers structure without linguistic rules. Common words become single tokens; rare words split into reusable pieces. "unhappy" might become ["un", "happy"] - but this is frequency-driven, not morphological analysis.

Vocab Tradeoffs ▼

Small vocab = longer sequences. Large vocab = bigger embedding matrix.

Tokenizer	Vocab	"Hello world"
Character	~100	11 tokens
GPT-2 BPE	50,257	2 tokens
GPT-4	100,256	2 tokens
GPT-4o	199,997	2 tokens
LLaMA 3	128,256	2 tokens

Sweet spot: 50K–200K tokens. Balances sequence length against embedding matrix size. Byte-level fallback guarantees open vocabulary - no "unknown token" failures.

Special Tokens ▼

BOS, EOS, PAD - boundary markers for sequences.

BOS (Beginning of Sequence): Signals start of generation
EOS (End of Sequence): Model learns when to stop
PAD: Fills shorter sequences in batches (masked in loss)

# microgpt: BOS wraps each document
tokens = [BOS] + encode("emma") + [BOS]
# Model learns: BOS → start generating, BOS → stop

What tokenization reveals about LLMs

LLMs never see characters - they see token IDs. "strawberry" might tokenize as ["str", "aw", "berry"]. The model has no direct access to individual letters, which explains why LLMs struggle with counting letters, spelling tasks, and character-level operations. Tokenization is linguistically blind - subword boundaries rarely align with morpheme boundaries.

Embeddings map token IDs to dense vectors - a lookup table that's mathematically equivalent to one-hot × weight matrix, but efficient. Token embeddings capture distributional similarity - tokens appearing in similar contexts get similar vectors. Position embeddings tell the model where each token is in the sequence.

Token Embedding ▼

Just indexing into a matrix. Row N = embedding for token N.

embedding = W_emb[token_id] ≡ one_hot(token_id) @ W_emb

# Karpathy's microgpt: embedding is a 2D list
wte = [[random.gauss(0, 0.08)
        for _ in range(n_embd)]
       for _ in range(vocab_size)]
tok_emb = wte[token_id]  # that's it - just indexing

Key insight: Each row is a learnable d-dimensional vector. Initialized randomly, then updated via backpropagation to capture co-occurrence patterns from training data.

Position Embedding ▼

Tells the model WHERE in the sequence. Learned (GPT-2) or rotational (RoPE).

Method	How	Used By
Learned	Lookup table, same as token embedding	GPT-2
Sinusoidal	Fixed sin/cos at different frequencies	Original Transformer
RoPE	Rotate Q/K vectors by position-dependent angle	LLaMA, Qwen, Gemma

Why RoPE dominates: rotation encodes relative position naturally.

(R_mq)^T(R_nk) = q^TR_n−mk

The dot product depends only on relative distance (n−m), not absolute positions. No learned embeddings needed, extrapolates to unseen lengths.

# microgpt: learned position embeddings
wpe = [[random.gauss(0, 0.08)
        for _ in range(n_embd)]
       for _ in range(block_size)]
pos_emb = wpe[pos_id]

Combined Input ▼

Sum token + position. Model learns to disentangle.

x = token_embedding + position_embedding

# The model's actual input for each token
x = [t + p for t, p in zip(tok_emb, pos_emb)]

This addition is not arbitrary - both embeddings live in the same vector space, and the model learns during training to disentangle "what token is this" from "where is it."

Weight Tying ▼

Same matrix for input embedding and output projection. ~30% fewer params.

In GPT-2, the input embedding matrix and the output "lm_head" are the same matrix (transposed). Tokens that are similar in meaning are close in embedding space, so they naturally get similar output probabilities.

Model	Embed Dim	Vocab	Embed Params
GPT-2 Small	768	50,257	38.6M
LLaMA 7B	4,096	32,000	131M
microgpt	16	27	688

What embeddings encode

Embeddings don't encode "meaning" - they encode distributional similarity. "king" is near "queen" because they appear in similar contexts in the training data, not because the model understands monarchy. The famous "king - man + woman ≈ queen" works because gender is a consistent direction in the training distribution.

Attention lets each position "look at" all other positions and decide which matter most. Think of it as a soft dictionary lookup: Query asks "what am I looking for?", Key answers "what do I contain?", Value provides "what to return if matched." The result is a weighted blend of all values - not a hard lookup.

Self-Attention ▼

softmax(QK^T / √d) @ V - the core operation.

Attention(Q, K, V) = softmax( Q·K^T / √d_k ) · V

Step by step:
1. Q·K^T - Similarity scores: every position vs every other
2. / √d_k - Scale to prevent softmax saturation
3. softmax - Convert to probabilities (sum to 1 per row)
4. · V - Weighted blend of value vectors

# From no-magic/microattention.py
def attention(q, k, v):
    scale = 1.0 / math.sqrt(len(q[0]))
    scores = matmul(q, transpose(k))
    scores = [[v * scale for v in row]
              for row in scores]
    weights = [softmax(row) for row in scores]
    return matmul(weights, v)

Why √d_k scaling? Without it, dot products grow proportionally to dimension (variance ≈ d_k). Large values push softmax toward one-hot outputs, killing gradients.

Query, Key, Value ▼

Three learned projections of the same input. Different roles.

Matrix	Role	Analogy
Query (Q)	What am I looking for?	Search query
Key (K)	What do I contain?	Document title
Value (V)	What info do I provide?	Document content

Q = x @ W_q  # learned projection
K = x @ W_k
V = x @ W_v

The separation allows the model to learn different representations for "what to look for" vs "what to match against" vs "what to return."

Causal Mask ▼

Future positions get −∞ before softmax → zero weight. Can't see the future.

For autoregressive generation, position t can only attend to positions 0..t-1. We set future scores to −∞ before softmax, so exp(−∞) = 0.

# Upper triangle = future positions
mask = torch.triu(ones(n, n), diagonal=1)
scores.masked_fill_(mask, -inf)
weights = softmax(scores)  # -inf → 0

# Result: triangular attention pattern
# pos 0: sees [self]
# pos 1: sees [pos 0, self]
# pos n: sees [pos 0, ..., self]

Multi-Head Attention ▼

Multiple attention patterns in parallel. Same FLOPs, richer representation.

Instead of one d-dimensional attention, use h heads each with d/h dimensions. Each head can specialize - one learns syntax, another semantics, another position. Same total FLOPs as single-head.

# Karpathy's microgpt: multi-head via slicing
for h in range(n_head):
    hs = h * head_dim
    q_h = q[hs:hs+head_dim]
    k_h = [ki[hs:hs+head_dim] for ki in keys]
    v_h = [vi[hs:hs+head_dim] for vi in values]
    # attention per head, concatenate all heads

W_O projection after concatenation is where cross-head mixing happens - combining signals from all attention patterns.

Variant	KV Heads	Memory	Used By
MHA	= Q heads	Highest	GPT-2
GQA	Q/group	Middle	LLaMA 2/3
MQA	1	Lowest	Falcon

The Feed-Forward Network processes each position independently: expand → activate → contract. Often called the "memory" of the transformer - factual knowledge is stored in these weights. Residual connections (x + f(x)) create a gradient highway that makes deep networks trainable.

FFN Structure ▼

Two linear layers with 4× expansion and activation in between.

FFN(x) = W₂ · activation( W₁ · x )

# Karpathy's microgpt: MLP block
x = linear(x, mlp_fc1)                # [d] → [4d] expand
x = [xi.relu() for xi in x]  # activate
x = linear(x, mlp_fc2)                # [4d] → [d] contract

Why 4× expansion? More capacity to compute complex functions without widening the residual stream. The model temporarily expands to a higher-dimensional space, processes, then compresses back.

Activations ▼

ReLU → GELU → SwiGLU: evolution of nonlinearity.

Activation	Formula	Used By
ReLU	max(0, x)	Educational, older models
GELU	x · Φ(x) ≈ smooth ReLU	GPT-2, GPT-3, BERT
SiLU/Swish	x · sigmoid(x)	Newer models
SwiGLU	(xW₁ ⊙ Swish(xV))W₂	LLaMA 2/3, PaLM

All serve the same purpose: introduce nonlinearity so the network can approximate arbitrary functions. Without activation, stacking linear layers collapses to a single linear transformation.

Residual Connections ▼

x + f(x) - gradient highway. Makes deep networks trainable.

x = x + attention(norm(x))
x = x + ffn(norm(x))

Without residuals, gradients must pass through every layer's transformations - they vanish in deep networks. The residual "highway" lets gradients flow directly backward unimpeded.

# Both attention and FFN are wrapped in residuals
x_residual = x
x = rmsnorm(x)
x = attention(x)
x = [a + b for a, b in zip(x, x_residual)]  # add skip

Pre-Norm vs Post-Norm ▼

Normalize BEFORE transformation (modern). Cleaner residual path.

Post-Norm (2017 Transformer): x = Norm(x + f(x))
Pre-Norm (GPT-2+): x = x + f(Norm(x))

Pre-norm keeps the residual path clean - no normalization layer on the skip connection. More stable training. All modern LLMs use pre-norm, typically RMSNorm instead of LayerNorm.

# RMSNorm (simpler than LayerNorm - no mean subtraction)
def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

Attention vs FFN: Division of labor

Attention routes information between tokens ("who talks to whom"). FFN processes information within each token ("what to say"). Interpretability research shows factual knowledge ("Paris is the capital of France") is stored in FFN weights, while attention handles the contextual routing. Together, one transformer block combines context-dependent retrieval with position-independent computation.

Training is simple: predict next token → compute loss → backpropagate → update weights. Cross-entropy loss measures "surprise" at the correct answer. Adam combines momentum with adaptive learning rates. This loop, repeated billions of times with different data, compresses the statistical patterns of language into model weights.

Softmax ▼

Converts logits to probabilities that sum to 1.

softmax(x_i) = exp(x_i) / Σ exp(x_j)

# Numerically stable softmax (subtract max first)
def softmax(logits):
    max_val = max(v.data for v in logits)
    exps = [(v - max_val).exp() for v in logits]
    total = sum(exps)
    return [e / total for e in exps]

Subtract max for stability: Without this, large logits cause exp() to overflow to infinity. The subtraction doesn't change the output probabilities.

Cross-Entropy Loss ▼

−log(p[correct]). High penalty for confident wrong answers.

loss = −log( P(correct token) )

If probability is 0.9 → loss = 0.105 (small, correct)
If probability is 0.01 → loss = 4.6 (large, wrong)

The log makes the penalty grow increasingly severe as confidence in the wrong answer increases. This is the only training signal - predict the next token better.

# The loss for one token prediction
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss = -probs[target_id].log()

Backpropagation ▼

Chain rule from loss to every parameter. Each param gets a gradient.

Starting from the loss, work backward through every operation: for each parameter, compute "if I nudge this value, how much does the loss change?"

# Karpathy's autograd backward pass
def backward(self):
    topo = []  # topological sort of computation graph
    build_topo(self)
    self.grad = 1  # seed: dloss/dloss = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad  # chain rule

6 primitive operations with known derivatives - add, mul, pow, exp, log, relu - compose to compute gradients through any neural network.

Adam Optimizer ▼

Momentum + adaptive rates. Bias correction for early steps.

m = β₁·m + (1-β₁)·∇L   (momentum)
v = β₂·v + (1-β₂)·∇L²   (variance)
θ = θ − lr · m̂ / √(v̂ + ε)   (update)

# Complete Adam from Karpathy's microgpt
for i, p in enumerate(params):
    m[i] = beta1 * m[i] + (1 - beta1) * p.grad
    v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
    m_hat = m[i] / (1 - beta1 ** (step + 1))  # bias correction
    v_hat = v[i] / (1 - beta2 ** (step + 1))
    p.data -= lr * m_hat / (v_hat ** 0.5 + eps)

Momentum averages past gradients → consistent direction. Adaptive scaling divides by gradient variance → big gradients get small steps, small gradients get big steps. Bias correction fixes the cold-start problem when m and v are initialized at 0.

The Training Loop ▼

Forward → loss → backward → update. Repeat billions of times.

# The ENTIRE learning algorithm
for step in range(num_steps):
    # 1. Forward: compute predictions
    logits = model(input_tokens)
    # 2. Loss: measure wrongness
    loss = cross_entropy(logits, target_tokens)
    # 3. Backward: compute gradients
    loss.backward()
    # 4. Update: nudge weights toward less loss
    optimizer.step()
    optimizer.zero_grad()

This 4-line loop, repeated with different data, is how every LLM learns. GPT-4 and microgpt use the same algorithm - only scale differs.

Learning Rate Schedule ▼

Warmup then decay. Start cautious, push hard, then refine.

Warmup: Start small because early gradients are noisy. Ramp up over first ~1% of training.
Cosine decay: Gradually reduce to near zero. As you approach the minimum, large steps overshoot.

# Karpathy's microgpt uses simple linear decay
lr_t = learning_rate * (1 - step / num_steps)

# Production models use warmup + cosine
if step < warmup_steps:
    lr = peak_lr * step / warmup_steps
else:
    progress = (step - warmup_steps) / (total - warmup_steps)
    lr = peak_lr * 0.5 * (1 + cos(pi * progress))

What training reveals

The training objective is next-token prediction - a purely statistical task. By predicting what comes next in trillions of tokens, the model is forced to learn grammar (to predict syntax), facts (to predict entities), and patterns that resemble reasoning (to predict logical conclusions). All of it compressed into weight matrices through gradient descent.

Educational sources and inspiration:
microgpt.py by Andrej Karpathy - "This file is the complete algorithm. Everything else is just efficiency."
no-magic by no-magic-ai - 46 algorithms from scratch, zero dependencies
LLMs-from-scratch by Sebastian Raschka - Build a Large Language Model (Manning, 2024)
Stanford CS336 - Language Modeling from Scratch (Hashimoto & Liang)
Stanford CME 295 - Transformers & Large Language Models (Amidi & Amidi) | cheatsheet

Related visualizations:
Multi-Head Attention - Interactive contextual embedding visualization
LLM Architecture Evolution - Activations, norms, block design 2017–present

Understanding LLMs: The Core Algorithm