Faster LLMs

Title: EAGLE and EAGLE-2: Lossless Inference Acceleration for LLMs

Abstract: This talk introduces the lossless large language model acceleration algorithm EAGLE and its follow-up, EAGLE-2 (“EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty” and “EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees”). EAGLE performs autoregression at a more structured feature level rather than at the token level, while incorporating sampling results to eliminate uncertainty. Thanks to these two innovations, EAGLE’s draft model is both lightweight and accurate, improving the inference speed of large language models by 2.1x–3.8x while ensuring the output distribution remains unchanged, provably. EAGLE-2 introduces dynamic draft trees, leveraging the confidence of the draft model to approximate the acceptance rate of draft tokens and dynamically adjust the structure of the draft tree to increase the average acceptance length. Building on EAGLE-1, EAGLE-2 achieves an additional 20%-40% speed improvement, resulting in a total acceleration of 2.5x–5.0x while provably maintaining the original output distribution. EAGLE and EAGLE-2 have also been adopted in industry and open-sourced frameworks, integrated into platforms such as vLLM, SGLang, Intel LLM Library for PyTorch, Intel Extension for Transformers, and more.

Bio: Hongyang Zhang is a tenure-track assistant professor at University of Waterloo and Vector Institute for AI. He received his PhD in 2019 from the Machine Learning Department at Carnegie Mellon University and completed a Postdoc at Toyota Technological Institute at Chicago. He is the winner of NeurIPS 2018 Adversarial Vision Challenge, CVPR 2021 Security AI Challenger, AAAI New Faculty Highlights, Amazon Research Award, and WAIC Yunfan Award. He also regularly serves as an area chair for NeurIPS, ICLR, ICML, AISTATS, AAAI, ALT, and an action editor for DMLR.