Activations introduce non-linearity into the
FFN. The evolution moves from hard gates (ReLU) to smooth approximations (GELU) to
gated linear units (GLU family) — where a sigmoid/swish gate
multiplies the linear branch, giving the network multiplicative control over information
flow. Below: click any entry for formulas and context.
Normalisation — Pre / Post / Skip / RMS
Where the norm sits inside the residual block determines training stability, gradient
flow, and depth scalability. The trend:
Post-LN → Pre-LN → Pre-RMSNorm → QK-Norm on top.
Post-LN (original transformer): norm
applied after residual addition — gradients can vanish in early layers at
depth. Pre-LN: norm before sub-layer — stable
gradients, dominant since GPT-2.
Skip-norm / No-norm: some hybrid
linear-attention layers omit norm entirely on the residual path.
Canonical Transformer Block — invariant structure since Vaswani et al. 2017
Structure is invariant: embed+pos → (norm →
attn → add → norm → FFN/MoE → add) × N → norm → logits.
What evolves: positional scheme ·
attention variant ·
FFN gating/MoE ·
norm placement. Source: Raschka (2025) — The Big LLM Architecture Comparison ·
magazine.sebastianraschka.com
Model
Year
Pos. Encoding
Attention
Activation / FFN
Norm
Notes
Model
Year
Lab
Activation
Attention
Pos. Embed
Norm
MoE
● = underrepresented in post-training
literature
Sources Sebastian Raschka
(2025) — The Big LLM Architecture Comparison Stanford CS336
— Language Modeling from Scratch (Spring 2025)