About This Visualization
This interactive diagram illustrates how multi-head attention transforms a polysemous word
("Amazon") based on surrounding context. The visualization shows:
-
Stage 0: Without context, "Amazon" sits equidistant from geographic and
corporate regions of embedding space.
-
Stage 1–2: Three attention heads (syntactic, semantic, entity) attend
differently to the sentence. Their outputs are concatenated and projected via W_O into the
final embedding h.
-
Task heads: The same h can be projected through different learned
matrices (W_lm for next-token prediction, W_cls for classification) to produce different
output distributions.
Caveat: All positions, weights, and probabilities are illustrative — not
computed from any model. The mechanism (Q·K^T, softmax, weighted sum, multi-head
concat, W_O projection, linear task heads) reflects how transformers actually work.
Related