The input pipeline — every LLM starts here
Raw text is never seen by the model. It passes through a fixed tokenizer (trained
separately on a corpus) that converts strings into integer sequences. Each integer
indexes a row in the embedding matrix to
produce a dense vector. Only then does the transformer begin.
BPE Algorithm — Training the Tokenizer
Byte Pair Encoding (Sennrich et al. 2016) is how GPT-2, LLaMA, Mistral, DeepSeek, and
most modern LLMs build their vocabulary. Start with 256 byte tokens, iteratively merge
the most frequent adjacent pair.
Why Tokenization Matters for Reasoning
Tokenization artifacts cause reasoning failures that look like cognitive deficits
but are representation problems baked in before the first attention layer fires.
Letter counting: "How many r's in strawberry?" fails
because "strawberry" tokenizes as
["str", "aw", "berry"] — the model never sees individual
letters.
Arithmetic: multi-digit tokens create implicit
computation the model must learn. "1234" as one token vs "1","2","3","4" as four tokens
require different learned circuits.
Reversal curse: the model only sees left-to-right
byte sequences. "A is B" doesn't teach "B is A" because the token sequences are
completely different.
Multilingual: English-trained tokenizers waste 2-3×
more tokens on Chinese/Arabic, leaving less context window for actual reasoning.
Counter-argument: reasoning models (o1, R1) partially overcome tokenization limits
via chain-of-thought decomposition — but this is a workaround, not a fix.
Side-by-side comparison of tokenizers used by major LLMs. Vocabulary size, algorithm,
compression ratio, and notable features.
| Model |
Vocab Size |
Algorithm |
Bytes/Token |
Pre-Tokenization |
Notes |