Evaluation & Metrics

Perplexity

A metric that measures how well a language model predicts text — lower perplexity means better predictions.

Perplexity is the standard metric for evaluating language models. It measures how "surprised" the model is by the test text — lower perplexity means the model assigns higher probability to the correct words, indicating better understanding.

Mathematically, perplexity is the exponentiation of the cross-entropy loss. A perplexity of 20 means the model is as uncertain as if it were choosing uniformly between 20 equally likely options at each token.

Comparison: modern LLMs achieve perplexity below 10 on high-quality text, down from 100+ for pre-transformer models.

Perplexity is useful for comparing models on the same test set but less useful for measuring downstream task performance. Modern evaluations supplement perplexity with task-specific benchmarks.

Related Terms

← Back to Glossary