Back to Paper Network

LLM Architecture Evolution 2017–

Activations introduce non-linearity into the FFN. The evolution moves from hard gates (ReLU) to smooth approximations (GELU) to gated linear units (GLU family) — where a sigmoid/swish gate multiplies the linear branch, giving the network multiplicative control over information flow. Below: click any entry for formulas and context.
Normalisation — Pre / Post / Skip / RMS
Where the norm sits inside the residual block determines training stability, gradient flow, and depth scalability. The trend: Post-LN → Pre-LN → Pre-RMSNorm → QK-Norm on top.
Post-LN (original transformer): norm applied after residual addition — gradients can vanish in early layers at depth.  Pre-LN: norm before sub-layer — stable gradients, dominant since GPT-2.  Skip-norm / No-norm: some hybrid linear-attention layers omit norm entirely on the residual path.
Canonical Transformer Block — invariant structure since Vaswani et al. 2017
Structure is invariant: embed+pos → (norm → attn → add → norm → FFN/MoE → add) × N → norm → logits.
What evolves: positional scheme · attention variant · FFN gating/MoE · norm placement.
Source: Raschka (2025) — The Big LLM Architecture Comparison · magazine.sebastianraschka.com
Model Year Pos. Encoding Attention Activation / FFN Norm Notes
Model Year Lab Activation Attention Pos. Embed Norm MoE
= underrepresented in post-training literature
Sources
Sebastian Raschka (2025) — The Big LLM Architecture Comparison
Stanford CS336 — Language Modeling from Scratch (Spring 2025)