LLM Architecture Evolution | The Thinking Machine That Doesn't Think

Activations introduce non-linearity into the FFN. The evolution moves from hard gates (ReLU) to smooth approximations (GELU) to gated linear units (GLU family) - where a sigmoid/swish gate multiplies the linear branch, giving the network multiplicative control over information flow. Below: click any entry for formulas and context.

Normalisation - Pre / Post / Skip / RMS

Where the norm sits inside the residual block determines training stability, gradient flow, and depth scalability. The trend: Post-LN → Pre-LN → Pre-RMSNorm → QK-Norm on top.

Post-LN (original transformer): norm applied after residual addition - gradients can vanish in early layers at depth. Pre-LN: norm before sub-layer - stable gradients, dominant since GPT-2. Skip-norm / No-norm: some hybrid linear-attention layers omit norm entirely on the residual path.

Canonical Transformer Block - invariant structure since Vaswani et al. 2017

Structure is invariant: embed+pos → (norm → attn → add → norm → FFN/MoE → add) × N → norm → logits.
What evolves: positional scheme · attention variant · FFN gating/MoE · norm placement.
Source: Raschka (2025) - The Big LLM Architecture Comparison · magazine.sebastianraschka.com

Model	Year	Pos. Encoding	Attention	Activation / FFN	Norm	Notes

Model	Year	Lab	Activation	Attention	Pos. Embed	Norm	MoE

● = underrepresented in post-training literature

LLM Architecture Evolution 2017–