Multi-Head Attention | The Thinking Machine That Doesn't Think

About This Visualization

This interactive diagram illustrates how multi-head attention transforms a polysemous word ("Amazon") based on surrounding context. The visualization shows:

Stage 0: Without context, "Amazon" sits equidistant from geographic and corporate regions of embedding space.
Stage 1–2: Three attention heads (syntactic, semantic, entity) attend differently to the sentence. Their outputs are concatenated and projected via W_O into the final embedding h.
Task heads: The same h can be projected through different learned matrices (W_lm for next-token prediction, W_cls for classification) to produce different output distributions.

Caveat: All positions, weights, and probabilities are illustrative — not computed from any model. The mechanism (Q·K^T, softmax, weighted sum, multi-head concat, W_O projection, linear task heads) reflects how transformers actually work.

LLM Architecture Evolution — attention variants, norm placement, block design
Implementation — minimal code for attention, multi-head, transformer blocks
Vaswani et al. (2017) — Attention Is All You Need
Clark et al. (2019) — What Does BERT Look At? An Analysis of BERT's Attention

Multi-Head Attention · Contextual Embedding

About This Visualization

Related