Weight decay
Applied to all non-bias, non-normalization parameters. Typical: λ = 0.1. In
AdamW, decay is decoupled: θ = θ − ηλθ, not folded
into gradient. Prevents weight magnitude explosion over long training runs.
Dropout
Surprisingly, most frontier LLMs use
zero dropout during pre-training (Llama 3,
GPT-4, DeepSeek V3). The massive dataset size provides sufficient regularization.
Dropout is re-introduced during fine-tuning (LoRA dropout = 0.05-0.1).
Gradient clipping
g = g × min(1, max_norm / ||g||)
Typical max_norm = 1.0
Prevents loss spikes from corrupting training. Critical for stability at scale. Llama 3
reports occasional loss spikes — resolved by rewinding to earlier checkpoint and
skipping problematic data batches.
QK-Norm
Dehghani et al. (2023) — Scaling ViTs; adopted by Gemma 2, Cohere, OLMo 2
Applies RMSNorm to query and key vectors before attention computation. Prevents
attention logit growth that causes training instability at scale. Increasingly standard
— used by Gemma 2, Gemma 3, OLMo 2, SmolLM3.
Z-loss
Auxiliary loss that penalizes large logits in the output layer. Prevents
representational collapse. Used by PaLM, Gemini. Stabilizes training without
constraining model capacity.