Tokenization | The Thinking Machine That Doesn't Think

The input pipeline — every LLM starts here

Raw text is never seen by the model. It passes through a fixed tokenizer (trained separately on a corpus) that converts strings into integer sequences. Each integer indexes a row in the embedding matrix to produce a dense vector. Only then does the transformer begin.

BPE Algorithm — Training the Tokenizer

Byte Pair Encoding (Sennrich et al. 2016) is how GPT-2, LLaMA, Mistral, DeepSeek, and most modern LLMs build their vocabulary. Start with 256 byte tokens, iteratively merge the most frequent adjacent pair.

Why Tokenization Matters for Reasoning

Tokenization artifacts cause reasoning failures that look like cognitive deficits but are representation problems baked in before the first attention layer fires.

Letter counting: "How many r's in strawberry?" fails because "strawberry" tokenizes as ["str", "aw", "berry"] — the model never sees individual letters.
Arithmetic: multi-digit tokens create implicit computation the model must learn. "1234" as one token vs "1","2","3","4" as four tokens require different learned circuits.
Reversal curse: the model only sees left-to-right byte sequences. "A is B" doesn't teach "B is A" because the token sequences are completely different.
Multilingual: English-trained tokenizers waste 2-3× more tokens on Chinese/Arabic, leaving less context window for actual reasoning.

Counter-argument: reasoning models (o1, R1) partially overcome tokenization limits via chain-of-thought decomposition — but this is a workaround, not a fix.

How text enters the transformer. Raw strings must become integer sequences before any computation happens — string → bytes → subword tokens → embedding vectors → model. Tokenization choices directly shape what the model can and cannot learn. Click any card for detail.

Side-by-side comparison of tokenizers used by major LLMs. Vocabulary size, algorithm, compression ratio, and notable features.

Model	Vocab Size	Algorithm	Bytes/Token	Pre-Tokenization	Notes

Tokenization — How Text Enters the Transformer

Sources