Training & LearningLayerNorm
Layer Normalization
A normalization technique that stabilizes training by normalizing activations across features within each sample.
Layer normalization normalizes activations across the feature dimension of each individual sample, unlike BatchNorm which normalizes across the batch. This makes it independent of batch size and a natural fit for sequence models.
LayerNorm is the standard normalization layer in transformers. It's applied before or after each sublayer (attention and feedforward) to stabilize training and enable very deep networks.
Pre-LN vs Post-LN: applying LayerNorm before sublayers (Pre-LN) generally trains more stably than after (Post-LN), and is standard in modern transformers.
Variants like RMSNorm simplify LayerNorm by skipping the mean subtraction, saving compute without hurting quality. Many modern LLMs use RMSNorm for efficiency.