FlashAttention
A memory-efficient attention algorithm that speeds up transformer training and inference by avoiding materialization of the full attention matrix.
FlashAttention, introduced by Tri Dao in 2022, is an IO-aware implementation of attention that computes results in tiles without materializing the full N×N attention matrix in memory. This dramatically reduces memory usage and improves speed.
It's mathematically identical to standard attention but uses the memory hierarchy of modern GPUs more efficiently — keeping intermediate results in fast SRAM instead of slow HBM.
FlashAttention is now standard in virtually all transformer training and inference. Later versions (FlashAttention-2, FlashAttention-3) pushed speed and efficiency further. It's one of the most impactful systems-level optimizations in modern AI.