Faster LLMs

Title: Advances in speculative decoding (heterogeneous vocabularies, speculation parallelism, dynamic lookahead)

Abstract: Speculative decoding is often effective at reducing latency and increasing throughput. However, it fails for slow or inaccurate drafters. The talk covers recent advances that mitigate this limitation and enhance efficiency. 1/ We present novel speculative decoding algorithms for heterogeneous vocabularies that enable drafters outside of the target model’s family, supported by theoretical guarantees to increase acceptance rates. These algorithms broaden the applicability of speculative decoding to models for which the smallest model in the family is too slow, or there is no family at all. After demonstrating up to 2.1x speedups with DeepSeek, Phi, Gemma, and Mixtral, they have become the default in Hugging Face Transformers (145k+ GitHub stars; ICML ‘25 oral, top 1%). 2/ We introduce speculation parallelism, a new type of parallelism that lets you run multiple instances of models in parallel, and explain how distributed speculative decoding uses it to effectively hide verification latency, yielding significant speedups with theoretical guarantees (ICLR ‘25). 3/ We present new heuristics for dynamic speculation lookahead scheduling that achieve real-world speedups and therefore have become the default in Hugging Face Transformers (NeurIPS ENLSP workshop ‘24, PMLR).

Bio: Nadav Timor is a PhD student at the Weizmann Institute of Science, visiting Yann LeCun’s lab at NYU.