Evaluation & Metrics
Perplexity
A metric that measures how well a language model predicts text — lower perplexity means better predictions.
Perplexity is the standard metric for evaluating language models. It measures how "surprised" the model is by the test text — lower perplexity means the model assigns higher probability to the correct words, indicating better understanding.
Mathematically, perplexity is the exponentiation of the cross-entropy loss. A perplexity of 20 means the model is as uncertain as if it were choosing uniformly between 20 equally likely options at each token.
Comparison: modern LLMs achieve perplexity below 10 on high-quality text, down from 100+ for pre-transformer models.
Perplexity is useful for comparing models on the same test set but less useful for measuring downstream task performance. Modern evaluations supplement perplexity with task-specific benchmarks.