Tokenization

The process of splitting text into smaller units (tokens) that a language model can process.

Tokenization is the preprocessing step that converts raw text into a sequence of tokens that a language model can process. Tokens are not exactly words — they're subword units, typically 3–6 characters long. "Tokenization" might become ["Token", "ization"]. Common words like "the" or "is" are usually single tokens; rare words get split into multiple pieces.

Modern LLMs use Byte-Pair Encoding (BPE) or similar algorithms to build a vocabulary of 30,000–100,000 tokens from training data. Frequent character sequences get merged into single tokens. This allows the model to handle any text — including new words, code, and non-English languages — without an unknown-word problem.

Practical impact: Token count determines cost and context window usage. English text averages ~1 token per 4 characters. Code, Chinese, and emoji are generally less efficient — more tokens per character. Always check token counts before sending large documents to APIs.

Why Tokenization Matters

Controls how efficiently text fits in the context window
Affects API costs (most providers charge per token)
Impacts how well the model handles different languages and code
Token boundaries can affect model performance on specific tasks

The tokenizer is model-specific — GPT-4 uses tiktoken, Llama uses SentencePiece. The same text can tokenize differently across models, affecting character counting tasks (e.g., "how many 'r's in strawberry?") where the model can't count individual characters within a token. This is a known limitation.

Related Terms

← Back to Glossary