Faster LLMs

Title: Adaptive Compute LLMs with Early Exits

Abstract: Scaling LLMs is a proven recipe for enhancing their reasoning capabilities. Unfortunately, larger models come with slower generation speeds and are costly to serve, especially with autoregressive text generation, where each next-token prediction requires a full pass through the Transformer model. However, not all token predictions are of the same difficulty. In practice, we find that LLMs can express their advanced reasoning capabilities by selectively leveraging their full capacity only for the few most challenging token predictions. Building on this intuition, we introduce early exits into LLMs and study effective solutions for obtaining inference speedups while controlling for the quality of the generated output. We discuss overcoming practical challenges when building and deploying early exit models. In the second part of the talk, we present a new recursive transformer architecture that combines early exits with parameter-sharing ideas and tailored low-rank adapters to maximize the inference speed of LLMs, even with large batch setups.

Bio: Tal Schuster is a Researcher at Google DeepMind, GenAI unit. Tal is leading an R&D group working on adaptive compute solutions for LLMs, with a focus on efficiency, reasoning and test-time scaling capabilities. Tal received multiple Perfy Awards and the Google Tech Impact Award. Tal completed his PhD at MIT.