Back to Paper Network

LLM Training

Phase 1: Pre-training Next-token prediction on trillions of tokens. The foundation: world knowledge, language structure, latent capabilities. Dominated by compute, data, and optimization.
Phase 2: Mid-training Annealing with high-quality data, domain adaptation, long-context extension. Bridges raw pre-training and behavioral fine-tuning.
Phase 3: Post-training SFT, RLHF, DPO, GRPO, RLVR. Alignment, instruction-following, reasoning. Surface compliance or genuine capability shift?
Pre
Mid
Post
Phase 1 · Pre-training
Training Loop
Forward pass, cross-entropy loss, backward pass, optimizer step. Repeated billions of times across trillions of tokens.
L = -1/T ∑ log p_θ(x_t | x_<t)
The entire capability of the model derives from this single objective: predict the next token. Every representation, every "fact", every apparent reasoning ability is a byproduct of compression under this loss.

Batch size: starts small (~0.5M tokens), ramps to 4-60M tokens. Gradient accumulation simulates large batches on limited hardware.
Training duration: GPT-3 = 300B tokens. Llama 3 = 15T tokens. Chinchilla-optimal for 70B ≈ 1.4T tokens, but modern models overtrain by 5-10× for inference efficiency.
Thesis
Every capability the model will ever exhibit traces back to this compression objective. Post-training can steer but not fundamentally extend what pre-training encoded.
|
Optimization
AdamW & Learning Rate
AdamW with warmup + cosine/WSD decay. The optimizer is the engine; the schedule is the fuel curve.
Evolution: SGD → SGD+Momentum → Adam → AdamW (decoupled weight decay) → Muon/SOAP (2025)

AdamW: θ_{t+1} = θ_t − η(m̂_t / (√v̂_t + ε) + λθ_t)
m_t = β_1 m_{t-1} + (1−β_1)g_t, v_t = β_2 v_{t-1} + (1−β_2)g_t²
Learning rate schedule:
Warmup: 0 → peak over ~2000 steps (prevents early instability)
Cosine decay: peak → ~0.1× peak over training
WSD (Warmup-Stable-Decay): warmup → constant → sharp decay. Used by Llama 3, enables mid-training branches.

Muon (Kimi K2, 2025): momentum-based orthogonalized update. Claimed 2× compute efficiency vs AdamW for MoE models. Uses Newton-Schulz orthogonalization on momentum matrix.
SOAP: Shampoo-style second-order approximation. More memory but better conditioning on large models.
⚠ Critical
Peak LR is the most sensitive hyperparameter. Too high = loss spikes and instability. Too low = undertrained model. Typically found by small-scale sweep + scaling law extrapolation.
|
Precision & Hardware
Mixed Precision Training
bf16/fp16 forward + backward, fp32 master weights and optimizer states. fp8 emerging for next-gen hardware.
Format Bits Range Use
fp32 32 ±3.4×10^38 Master weights, optimizer states
bf16 16 Same as fp32 Forward/backward pass (standard)
fp16 16 ±65504 Older GPUs, needs loss scaling
fp8 (E4M3) 8 ±448 H100/B100 matmuls, ~2× throughput

Why bf16 won: same exponent range as fp32 (8 bits), so no loss scaling needed. fp16 has only 5 exponent bits — requires careful scaling to avoid overflow/underflow.

DeepSeek V3 fp8: full fp8 training with fine-grained quantization (per-tile scaling). First frontier model trained entirely in fp8. 2.788M H800 GPU-hours at $5.58M total.
Thesis
If models can be trained in 8-bit precision with no quality loss, the representations may be less precise than assumed — more consistent with approximate pattern matching than exact symbolic reasoning.
|
Distributed
Parallelism & Scale
Data, tensor, pipeline, and expert parallelism. ZeRO sharding for memory efficiency. 1000s of GPUs in lockstep.
Data Parallel (DDP): each GPU holds full model, splits data. All-reduce gradients. Simple but limited by model size.

ZeRO (1/2/3): shard optimizer states (Z1), gradients (Z2), or parameters (Z3) across GPUs. Llama 3: FSDP (Z3 equivalent).

Tensor Parallel (TP): split individual layers across GPUs. Column-parallel for first linear, row-parallel for second. Requires fast interconnect (NVLink).

Pipeline Parallel (PP): split layers across GPU groups. Micro-batching fills the pipeline bubble. GPipe, 1F1B schedules.

Expert Parallel (EP): distribute MoE experts across GPUs. All-to-all communication for token routing.

Llama 3 405B: 4D parallelism (TP=8 + PP=16 + CP=4 + DP) on 16,384 H100s. 95% uptime with automatic failure recovery.
⚠ Critical
Communication overhead scales with parallelism degree. At 10,000+ GPUs, collective communication accounts for 30-40% of wall-clock time. Hardware efficiency (MFU) rarely exceeds 50%.
|
Compute Budget
Scaling Laws
Loss = f(N, D, C). Chinchilla: train N and D proportionally. Modern practice: overtrain smaller models for inference savings.
L(N, D) = A/N^α + B/D^β + L_∞
Kaplan (2020): α ≈ 0.076, β ≈ 0.095
Chinchilla (2022): optimal D ≈ 20×N (given fixed compute)
Chinchilla (2022): for compute-optimal training, data and parameters should scale proportionally. Gopher (280B, 300B tokens) was undertrained; Chinchilla (70B, 1.4T tokens) matched it with 4× less compute.

Modern overtraining: Llama 3 8B trained on 15T tokens (1875× Chinchilla-optimal ratio). Inference cost dominates: cheaper to overtrain once than serve an underperforming model billions of times.

Compute cost: ~6ND FLOPs per training token (forward + backward). GPT-4 estimated at ~10^25 FLOPs. Llama 3 405B = 3.8×10^25 FLOPs.
Thesis
Scaling laws show smooth power-law improvement — no phase transitions, no sudden emergence of "understanding." Consistent with compression getting incrementally better, not qualitative capability jumps.
Mid-training begins
Phase 2 · Mid-training
Annealing & Cooldown
Final 1-5% of training tokens at decayed LR with curated, high-quality data. Crystallizes knowledge before post-training.
Llama 3: final 40M tokens at elevated quality. WSD schedule enables branching — checkpoint the stable phase, then anneal separately for different downstream objectives.

OLMo 2: explicit "annealing mix" with carefully balanced domain proportions. Math/code upweighted 2-3× vs pre-training mix.

Qwen 2.5: long-context extension folded into annealing phase — gradually increase context from 4K to 32K with RoPE rescaling.

Mechanism: decayed LR acts as implicit regularization — model consolidates patterns rather than learning new ones. Analogous to simulated annealing in optimization.
⚠ Critical
Data composition during annealing has outsized impact. Poor mix here can undo gains from trillions of pre-training tokens.
|
Domain Adaptation
Continued Pre-training
Additional pre-training on domain-specific corpora. Medicine, law, code, science. Same objective, targeted data.
CodeLlama: continued pre-training on 500B code tokens from Llama 2 base. Fills-in-the-middle (FIM) objective added: predict masked span given prefix and suffix.

BioMedLM / Med-PaLM: biomedical text continuation. Improves domain vocabulary coverage and factual recall in-domain.

Key insight: continued pre-training is cheaper than training from scratch. 50-100B domain tokens on a 7B model typically suffices for meaningful domain adaptation.

Risk: catastrophic forgetting of general capabilities. Mitigated by mixing 10-20% general data (replay buffer).
Thesis
Domain adaptation works because the base model already encodes general language patterns. Specialization is distribution shift, not knowledge creation.
|
Context Extension
Long-Context Training
RoPE rescaling, progressive extension from 4K to 128K+ tokens. Enables reasoning over long documents.
RoPE ABF (Llama 3): increase RoPE base frequency from 10K to 500K. Train on progressively longer sequences: 8K → 32K → 128K.

YaRN (Qwen): NTK-aware interpolation for RoPE. Extends context without full retraining — only ~1B tokens of long-context data needed.

MiniMax M1: native 1M context via O(n) Lightning Attention. Linear attention eliminates quadratic cost entirely.

Typical recipe: 2-stage approach. Stage 1: 100B tokens at extended context (mostly short, some long). Stage 2: fine-tune on long-context tasks specifically.
Post-training begins
Phase 3 · Stage 1
Supervised Fine-Tuning (SFT)
Distribution shift toward instruction-following. Teaches format and prior — not world knowledge.
p_θ(y|x) → p_θ(y|x, instruction prior)
• Data increasingly synthetic (persona-driven, task-specific, decontaminated)
Tülu 3: 939k prompts (57% public, 43% synthetic); Llama 4: prunes >50% "easy" data
DeepSeek-V3: 1.5M instances — reasoning distilled from R1 + human-verified
• PEFT: LoRA / DoRA at r=8–16 implies low intrinsic dimension of behavioral updates
⚠ Critical
LIMA "style steering" hypothesis holds for generic instruction-following. Breaks for reasoning scaffolds (cold-start CoT SFT is structurally necessary for RLVR stability) and genuine domain gaps.
|
Pipeline diverges here
Standard Alignment
Stage 2 · 2023+
Rejection Sampling
Best-of-N filtering via RM or verifier. Improves data quality without full RL.
Tülu 3: explicit stage; on-policy generations vs other models
DeepSeek-V3: RS on R1 rollouts — concise, formatted, verified
Kimi k1.5: shortest rejection sampling for long2short transfer
⚠ Critical
Selecting only correct outputs biases toward "easy" correct responses.
|
Stage 3 · Preference Alignment
Path A
2022–present
RLHF / GRPO
Reward model + PPO/GRPO. KL-regularized RL.
max_π E[r(x,y)] − β·D_KL(π ‖ π_SFT)
GRPO: no value fn, group-normalize rewards:
A_i = (r_i − mean(r)) / std(r)
CISPO (MiniMax M1): clips IS weights; all tokens in gradient — more efficient than GRPO/DAPO.
or
Path B
2023–dominant
DPO / IPO
Contrastive log-likelihood. No RM needed.
log σ(β(log π_θ(y_w|x) − log π_θ(y_l|x)))
Tülu 3: length-normalized DPO
Llama 3.1/4, OpenAI: primary preference method
⚠ Critical
Static offline data — no exploration. Ceiling lower than online RL for reasoning.
or
Path C
2022–present
RLAIF / CAI
AI judge + constitution. Self-distillation with normative constraints.
Pioneered by Anthropic Constitutional AI (2022).

Claude 2026: constitution explains why for novel case generalization.
Open question: converges to human alignment or recursive self-consistency?
⚠ Critical
Amplifies existing model biases at scale. Constitution encodes authors' assumptions.
|
Stage 4 · 2024+
Iterative Refinement
Model-generated synthetic data each round. SFT and RL under unified objective.
Tülu 3: eval-driven iteration — benchmark suite identifies skill gaps
DeepSeek-V3: R1 distillation → V3 SFT → V3 alignment
Kimi: curriculum + prioritized sampling (hard/underperformed problems)
SFT = offline exploitation (low variance, high bias)
RL = online exploration (high variance, low bias)
⚠ Critical
No convergence guarantee. Systematic errors entrench rather than self-correct.
iterative loop
Reasoning (LRM)
Step 1
SFT on Long CoT
Cold-start: curated long CoT traces. Teaches <think> scaffold.
R1-Zero skips this entirely — direct GRPO on verifiable problems. Full R1 uses limited human-aligned CoT cold-start first.
|
Step 2 · 2024–2025
RLVR / Pure RL
Verifiable outcome rewards. No RM. GRPO/CISPO. Emergent self-reflection.
DeepSeek-R1-Zero: 80k verifiable problems, 16 samples/group. Emergent: self-reflection, strategy switching, 5–7× longer CoT.

MiniMax M1 (CISPO): clips IS weights, all tokens in gradient. 512 H800s, 3 weeks, ~$534k. AIME 68%→80%.

OLMo 3 RL Zero: per-domain checkpoints (math/code/IFeval) — open resource for contamination research in RLVR.

Tülu 3 RLVR: deterministic verifiers. +1.7 MATH, +3.3 GSM8K from DPO baseline.
|
Step 3
RS + SFT (stabilize)
Filter best RLVR rollouts, mix with general data, crystallize gains.
• Keep: correct + readable + well-formatted
• Mix with general instruction data (prevents regression)
• ~2 epochs — distills RL-discovered reasoning into stable behavior
|
Step 4
RLHF / RFT Polish
Final alignment pass. Process supervision on CoT steps (OpenAI RFT).
R1 full: second RL stage with preference RM + rule rewards (language consistency)
OpenAI RFT: expert grader scores reasoning process step-by-step.
|
Stage 5 · 2024–2026
Specialized Enhancements
Reasoning depth, factuality, multimodal, agentic, safety, efficiency.
Reasoning (LRM)
Process supervision, exploratory RL, self-verification
Factuality
FLAME: factuality-aware SFT + DPO (Meta)
Agentic / long-context
Tool-use synthesis, joint environment RL (Kimi K2, GLM-4.5, MiniMax M2.5)
Multimodal
Zero-vision SFT + joint text-vision RL (Kimi K2.5, Llama 4)
Safety
Red-team preference passes · targeted DPO
Efficiency
Quantization · distillation · PEFT · linear attention (MiniMax)
|
Output
Production-Ready Aligned Model
Helpful · Honest · Harmless · Instruction-following · Reasoning-capable · Agentic
Major Labs
Meta (Llama 3.1 / 4)
Pre: 15T tokens, 4D parallelism on 16,384 H100s. Post: Iterative SFT + RS + DPO. 4: SFT → Online RL (hard prompts) → DPO polish.
Anthropic (Claude)
Constitution-driven synthetic data + RLAIF. 2026: constitution explains why, enabling generalization across novel cases.
OpenAI
SFT → DPO for standard models. RFT for reasoning: expert grader scores CoT process step-by-step. o1/o3: test-time compute scaling.
Google DeepMind
Gemini: TPU training at scale. PPO-based alignment. Gemini 1.5/2 adds long-context alignment passes.
Microsoft (Phi-3 / Phi-4)
"Textbook-quality" synthetic SFT. Orca/Orca-2: process supervision for step-level reasoning. Frontier reasoning at 3B–14B.
xAI (Grok-3)
DeepThink: GRPO-style reasoning RL. Standard SFT + DPO for base.
NVIDIA (Nemotron-4)
HelpSteer2 synthetic preference pipeline + open reward model adopted by other labs.
MiniMax (M1 / M2.5)
M1: CISPO on hybrid MoE + Lightning Attn (456B/45.9B, 1M ctx).
M2.5: 80.2% SWE-Bench, 76.3% BrowseComp. RL across 100k+ environments.
Open-weights labs
AllenAI (Tülu 3 / OLMo 3) ☆ fully open
Weights + data + code + eval. Tülu 3: SFT → DPO → RLVR. 70B matches GPT-4o-mini. OLMo 3: Dolci suite, RL Zero, OLMoTrace. Best fully open 32B thinking model at release.
DeepSeek (V3 / R1) open weights
V3: fp8 training, $5.58M total. SFT 1.5M (R1-distilled) + GRPO. R1-Zero: pure GRPO, no SFT. 79.8% AIME 2024, 97.3% MATH-500.
Moonshot AI (Kimi k1.5 / K2 / K2.5)
k1.5: online mirror-descent RL (length penalty, curriculum). K2: Muon optimizer, agentic synthesis. K2.5: joint text-vision RL.
Zhipu AI (GLM-4.5 / GLM-5)
Slime async RL framework. APRIL: agentic RL. 744B/40B active. Huawei Ascend — zero NVIDIA dependency.
Alibaba / Qwen
GRPO for math. QwQ-32B: RLVR extended CoT. SFT → GRPO → DPO.
ByteDance (Seed / Doubao)
Seed-Thinking v1.5: GRPO on verifiable problems. Deployment scale makes efficiency-constrained post-training practically significant.
Trends 2024–2026
Verifiable / rule-based RL surge — RLVR, GRPO, CISPO: emergent long-CoT without human trajectories.
Synthetic + on-policy dominance — distillation loops, persona generation, agentic synthesis.
DPO over PPO — stable and scalable; PPO/GRPO reserved for reasoning or online exploration.
Linear attention at frontier scale — MiniMax M1/M2.5: O(n) Lightning Attention now competitive.
Agentic focus — long-context RL, tool-use, joint multimodal RL.
Full openness raises the floor — Tülu 3, OLMo 3, Nemotron-4 as fully open blueprints.
§1 Optimizer Evolution

The path to AdamW

Loshchilov & Hutter (2019) — Decoupled Weight Decay Regularization
SGD: θ = θ − η∇L. Simple but requires careful tuning and struggles with saddle points. Momentum adds exponential moving average of gradients.

Adam: adaptive learning rates per-parameter via first (mean) and second (variance) moment estimates. Converges faster but L2 regularization interacts badly with adaptive rates.

AdamW: decouples weight decay from gradient update. Standard since GPT-2.
θ_{t+1} = θ_t − η(m̂_t / (√v̂_t + ε) + λθ_t)
Key: λθ_t applied directly, not scaled by adaptive rate
Typical hyperparameters: β_1 = 0.9, β_2 = 0.95, ε = 10^−8, weight decay λ = 0.1. Llama 3, GPT-4, Gemini all use AdamW with minor variations.

Next generation: Muon & SOAP

Kimi K2 (2025) — Muon optimizer for MoE pre-training
Muon (Kimi K2): applies Newton-Schulz orthogonalization to the momentum matrix. Claims 2× compute efficiency vs AdamW for MoE architectures. Orthogonalization prevents gradient collapse in expert routing.

SOAP: Shampoo-like second-order approximation. Maintains per-layer preconditioners. Higher memory cost but better conditioning on very large models. Promising for models >100B parameters.

Schedule-free optimizers: recent work removes the need for explicit LR schedules by using Polyak-style averaging. Not yet proven at frontier scale.

Practical impact

Optimizer Memory per param Used by Status
SGD+M 1 state RL fine-tuning Niche
AdamW 2 states (m, v) GPT-4, Llama 3, Gemini Dominant
Muon 1 state + orthog. Kimi K2 Emerging
SOAP 2 states + precond. Research Experimental
§2 Learning Rate Schedules

Why schedules matter

LR is the single most impactful hyperparameter. Too high: loss spikes, training instability, potential divergence. Too low: slow convergence, wasted compute. The schedule shapes the entire loss trajectory.

Common schedules

Cosine: η(t) = η_min + 0.5(η_max − η_min)(1 + cos(π t/T))
Linear warmup: η(t) = η_max × t/T_warmup, for t < T_warmup
Cosine decay (GPT-3, GPT-4): smooth annealing from peak to ~10% of peak. Standard choice. Commits to a fixed training budget at start.

WSD (Warmup-Stable-Decay): three phases: (1) warmup to peak, (2) hold constant for majority of training, (3) sharp cosine/linear decay in final 10-20%. Key advantage: can branch checkpoints from the stable phase for different downstream objectives. Used by Llama 3, MiniMax, OLMo 2.

Inverse square root: η(t) = η_max / √t. Unbounded training — no need to set total steps. Used by some older models (original Transformer).

Practical values: peak LR scales with batch size. Llama 3 8B: 3×10^−4. Llama 3 405B: 8×10^−5. Warmup: 2000 steps typical. GPT-4 reportedly uses ~6000 warmup steps.
§3 Scaling Laws & Compute Allocation
Kaplan et al. (2020) — Scaling Laws for Neural Language Models
Hoffmann et al. (2022) — Chinchilla: Training Compute-Optimal LLMs

Kaplan (2020) — first scaling laws

L(N) ≈ (N_c / N)^0.076 — loss vs parameters
L(D) ≈ (D_c / D)^0.095 — loss vs data
L(C) ≈ (C_c / C)^0.050 — loss vs compute
Key insight: smooth power laws with no discontinuities. Suggested scaling parameters faster than data (allocate more compute to larger models).

Chinchilla (2022) — corrected scaling

Kaplan underestimated the importance of data. Chinchilla showed optimal scaling is roughly D ≈ 20N (20 tokens per parameter for compute-optimal training). 70B model + 1.4T tokens matched 280B Gopher + 300B tokens.

Impact: shifted the field from "bigger models" to "more data." Llama, Mistral, and Gemma all follow Chinchilla-informed ratios.

Beyond Chinchilla: overtraining

Modern practice deliberately overtrains relative to Chinchilla-optimal:
Model Params Tokens Chinchilla ratio
Chinchilla 70B 1.4T 1× (optimal)
Llama 2 7B 7B 2T ~14×
Llama 3 8B 8B 15T ~94×
SmolLM3 3B 3B 11T ~183×

Rationale: training cost is paid once; inference cost is paid billions of times. A smaller, overtrained model is cheaper to deploy than a larger, compute-optimal one.

Thesis relevance

Scaling laws predict smooth, continuous improvement — no phase transitions. "Emergent abilities" may be artifacts of nonlinear evaluation metrics applied to smooth underlying capability curves (Schaeffer et al., 2023).
§4 Precision & Numerical Formats
Micikevicius et al. (2018) — Mixed Precision Training
DeepSeek-AI (2024) — DeepSeek-V3: fp8 Training at Scale

The precision hierarchy

fp32 (8 exp, 23 mant) → bf16 (8 exp, 7 mant) → fp8 E4M3 (4 exp, 3 mant)
Memory: 4 bytes → 2 bytes → 1 byte per element
FLOPS: 1× → ~2× → ~4× throughput (on matching hardware)
Mixed precision recipe:
1. Store master weights in fp32
2. Cast to bf16 for forward/backward pass
3. Compute gradients in bf16
4. Accumulate and apply in fp32

Why bf16 over fp16: bf16 has the same 8-bit exponent as fp32, so it covers the same dynamic range. fp16 has only 5 exponent bits — needs loss scaling to avoid underflow in gradients. bf16 "just works" on modern hardware (A100+).

fp8 frontier (DeepSeek V3): fine-grained quantization with per-tile scaling factors. Each 128-element tile gets its own scale. Result: full fp8 training with no measurable quality loss. 2.788M H800 GPU-hours, $5.58M total cost.

Inference quantization

Post-training quantization pushes further: INT4/INT8 (GPTQ, AWQ), even INT2 (QuIP#). Inference-only — lower fidelity acceptable when not training. Suggests representations are robust to significant precision reduction.
§5 Regularization & Stability

Weight decay

Applied to all non-bias, non-normalization parameters. Typical: λ = 0.1. In AdamW, decay is decoupled: θ = θ − ηλθ, not folded into gradient. Prevents weight magnitude explosion over long training runs.

Dropout

Surprisingly, most frontier LLMs use zero dropout during pre-training (Llama 3, GPT-4, DeepSeek V3). The massive dataset size provides sufficient regularization. Dropout is re-introduced during fine-tuning (LoRA dropout = 0.05-0.1).

Gradient clipping

g = g × min(1, max_norm / ||g||)
Typical max_norm = 1.0
Prevents loss spikes from corrupting training. Critical for stability at scale. Llama 3 reports occasional loss spikes — resolved by rewinding to earlier checkpoint and skipping problematic data batches.

QK-Norm

Dehghani et al. (2023) — Scaling ViTs; adopted by Gemma 2, Cohere, OLMo 2
Applies RMSNorm to query and key vectors before attention computation. Prevents attention logit growth that causes training instability at scale. Increasingly standard — used by Gemma 2, Gemma 3, OLMo 2, SmolLM3.

Z-loss

Auxiliary loss that penalizes large logits in the output layer. Prevents representational collapse. Used by PaLM, Gemini. Stabilizes training without constraining model capacity.
§6 Post-Training Optimization

SFT mechanics

Standard cross-entropy on curated instruction-response pairs. Key difference from pre-training: loss computed only on response tokens, not on instruction/prompt tokens (causal masking).

LR: ~10× lower than pre-training peak (e.g., 2×10^−5 for 7B model). 1-3 epochs typical. Overfitting is the primary risk.

RL mechanics: GRPO

DeepSeek-AI (2025) — DeepSeek-R1; Shao et al. (2024) — DeepSeekMath
J(θ) = E_q[Σ_g min(r_g × A_g, clip(r_g, 1−ε, 1+ε) × A_g) − β D_KL(π_θ || π_ref)]
where r_g = π_θ(o_g|q) / π_old(o_g|q), A_g = (R_g − mean) / std
Group of G samples per prompt. No value network (unlike PPO). Group-level normalization of advantages provides baseline. KL penalty prevents drift from reference policy. Clip ratio ε = 0.2 typical.

RL mechanics: DPO

Rafailov et al. (2023) — Direct Preference Optimization
L_DPO = −log σ(β(log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x)))
Closed-form solution to the RLHF objective under Bradley-Terry preference model. No reward model needed. Offline: uses static preference pairs.

Limitation: no online exploration. Can only optimize within the support of the preference dataset. Ceiling on complex reasoning tasks where the model needs to discover novel solutions.

Structural summary

Method Math form Function Limitation
SFT MLE on curated data Instruction prior Bounded by demo quality
RLHF (PPO) max E[r] − β·D_KL Online preference shaping Expensive; RM hacking
DPO Contrastive log-ratio Efficient preferences Static; no exploration
GRPO Group-normalized advantage RL without value fn Discards clipped tokens
CISPO Clip IS; all tokens Efficient RL Needs reliable reward
RLVR Binary verifier ∈ {0,1} Emergent reasoning Verifiable domains only
§1 Does Scale Create Understanding?

The central question

Pre-training compresses trillions of tokens into model parameters. Scaling laws show smooth power-law improvement. The thesis asks: does this compression produce genuine understanding, or just increasingly effective pattern matching?

Evidence for "just compression":
• Scaling laws show no phase transitions — smooth curves, no sudden jumps
• Schaeffer et al. (2023): "emergent abilities" are artifacts of nonlinear metrics
• Models fail on trivially modified versions of training-distribution problems
• fp8 training with no quality loss suggests representations are approximate

Evidence for "something more":
• In-context learning emerges without explicit training for it
• Compositional generalization improves with scale (on some benchmarks)
• Internal representations encode linear probes for truth, spatial relations
• RLVR produces emergent self-reflection behaviors not in training data
§2 Pre-training: Compression or Cognition?

The compression hypothesis

Deletang et al. (2024) — Language Modeling Is Compression
Next-token prediction is mathematically equivalent to lossless compression (Shannon, 1948). A model that perfectly predicts the next token achieves optimal compression of the training distribution.

Implications: every "capability" is a compression artifact. The model doesn't "know" facts — it encodes statistical regularities that happen to correlate with factual knowledge. It doesn't "reason" — it compresses patterns that look like reasoning in training data.

The optimizer shapes the compression

AdamW with warmup + cosine decay creates a specific learning dynamic: early training captures broad distributional patterns (high LR, fast learning); late training refines fine-grained distinctions (low LR, slow consolidation). This is compression at different granularities, not a transition from "pattern matching" to "understanding."

Critical batch size

McCandlish et al. (2018) — An Empirical Model of Large-Batch Training
Below critical batch size: gradient noise provides useful regularization. Above it: diminishing returns. The existence of a critical batch size suggests the optimization landscape has a characteristic scale — consistent with compression, not with models discovering abstract principles.
§3 Mid-training: The Overlooked Phase

Why mid-training matters for the thesis

The annealing phase reveals what the model has actually learned vs. what it can be steered toward. If annealing on math data improves math performance disproportionately, the base model had latent math capability that was merely being surfaced — not created.

WSD schedule evidence: Llama 3 uses WSD, enabling multiple anneal branches from the same stable-phase checkpoint. Different anneal mixes produce different downstream capabilities. This is consistent with the model as a compressed representation that can be selectively "unzipped" for different tasks.

Domain adaptation evidence: continued pre-training on 50-100B domain tokens can dramatically shift performance on domain tasks. The efficiency of this (50B tokens vs. 15T original) suggests the base model already encodes the relevant patterns in a compressed form — domain data merely amplifies them.

Long-context extension: RoPE rescaling works because positional encoding is a continuous function. The model generalizes to longer contexts because it learned local patterns that compose — not because it "understands" document structure.
§4 Post-training: Representation vs. Behavior

The wrapper hypothesis

Does post-training change cognition (internal representations) or only surface policy (output distribution)?

Mechanistic evidence: core world models are largely preserved through SFT and DPO. Alignment layers act as policy heads. LoRA's effectiveness at r=8-16 implies the behavioral update is low-rank — a thin veneer, not deep surgery.

Where the wrapper breaks

RLVR on reasoning appears to alter internal computational pathways. R1-Zero develops emergent self-verification and extended CoT that are not obvious pre-training artifacts. The model allocates more compute to harder problems (longer CoT) — this adaptive resource allocation is not trivially explained by surface pattern matching.

Counter-argument: RLVR may simply be selecting for pre-existing computational pathways that produce correct answers. The "emergent" behaviors were always latent in the base model — RLVR provides the selection pressure to surface them. This is optimization over existing repertoire, not creation of new capabilities.

The alignment tax question

Does preference optimization degrade reasoning? Evidence is mixed and confounded by benchmark saturation. DPO can collapse output diversity. RLHF can induce sycophancy. Both suggest post-training constrains rather than expands the model's effective capability space.
§5 Test-Time Compute & Training-Inference Boundary
Snell et al. (2024) — Scaling LLM Test-Time Compute Optimally
OpenAI (2024) — o1; DeepSeek-AI (2025) — R1; MiniMax (2025) — M1

The dissolving boundary

Post-training is no longer a discrete phase — it extends into inference-time optimization. o1/o3, R1, M1 all use extended inference-time computation to improve reasoning. The boundary between "what the model knows" and "what the model computes at inference" is blurring.

MiniMax M1: Lightning Attention makes test-time compute scaling economically viable — ~25% FLOPs of DeepSeek-R1 at 100K generation length. First linear-attention model competitive at frontier scale.

Thesis implications

If models can "think harder" at inference time and produce better answers, does this constitute reasoning? Or is it search over the model's existing pattern repertoire with more compute budget?

The chess analogy: AlphaGo's MCTS (Monte Carlo Tree Search) is clearly "search," not "understanding." LLM test-time compute may be the same — systematic exploration of the model's compressed representation space, not genuine deliberation.
§6 Open Research Problems
Scaling law phase transitions
Do scaling laws truly have no discontinuities, or do abrupt capabilities emerge at specific compute thresholds? Evidence increasingly favors smooth curves with nonlinear metrics creating the illusion of emergence.
Optimizer impact on representations
Does the choice of optimizer (AdamW vs. Muon vs. SOAP) affect what the model learns, or only how fast it learns? If different optimizers produce equivalent models, the representations are determined by data, not algorithm.
Annealing as capability selection
Does annealing on domain data create new capabilities or surface existing ones? WSD branching experiments could directly test this.
Precision lower bound
What is the minimum precision at which training quality degrades? If models train well in fp4, the representations must be inherently low-precision — more consistent with heuristic matching than exact computation.
RLVR: selection or creation?
Does RLVR create new computational pathways or select from pre-existing ones? Mechanistic interpretability studies on R1-Zero could resolve this.
Alignment tax
Does preference optimization degrade reasoning? Evidence mixed; confounded by benchmark saturation. Need controlled studies with held-out evaluations.
Post-training scaling laws
Current scaling laws model pre-training only. No equivalent framework for predicting post-training gains from compute investment. Active research area.
Data composition sensitivity
Why does annealing data composition have outsized impact? If 0.1% of training tokens (annealing phase) can shift capabilities dramatically, what does this say about the robustness of learned representations?
§7 Training Pipeline Summary
Phase Objective Duration Thesis angle
Pre-training Next-token prediction (cross-entropy) Weeks–months, trillions of tokens Compression; no phase transitions
Annealing Consolidation on high-quality data Final 1–5% of tokens Surfaces latent capabilities
Domain adaptation Distribution shift to target domain 50–100B tokens Amplifies existing patterns
Context extension RoPE rescaling + long-context data ~1B tokens Composable local patterns
SFT Instruction prior + format 1–3 epochs, 100K–1.5M examples Low-rank surface steering
RLHF / DPO Preference alignment Days Policy head, not cognition
RLVR / GRPO Verifiable reasoning RL Days–weeks Selection or creation? Open
Test-time compute Inference-time search Per-query Search over repertoire
Sources
Stanford CS336 (Spring 2025) · Kaplan et al. (2020) · Hoffmann et al. (2022, Chinchilla) · Loshchilov & Hutter (2019, AdamW) · Micikevicius et al. (2018) · Ouyang et al. (2022, InstructGPT) · Rafailov et al. (2023, DPO) · DeepSeek-AI (2024/2025) · Team OLMo (2025) · MiniMax (2025) · Kimi (2025)