Faster LLMs

Title: Optimizing attention for modern hardware

Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes. We describe recent advances in this area, including optimizations for Hopper GPUs. exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. This allows us to reach up to 85% of theoretical max TFLOPS. We will then cover advanced techniques and optimization for inference, such as persistent kernels, load balancing, GQA packing. Finally we examine new attention variants designed specifically for inference efficiency and test-time compute.

Bio: Tri Dao is an Assistant Professor at Princeton University and chief scientist of Together AI. He completed his PhD in Computer Science at Stanford. He works at the intersection of machine learning and systems, and his research interests include hardware-efficient algorithms and sequence models with long-range memory. His work has received the COLM 2024 Outstanding paper award and ICML 2022 Outstanding paper runner-up award.