3:30 PM EST
• 1 min read
Ion Stoica
Ion Stoica
Berkeley & Anyscale & Databricks

Title: Accelerating LLM Inference with vLLM (and SGLang)

Abstract: Inference efficiency remains a critical challenge for deploying large language models (LLMs) at scale. In this talk, I will present our work on LLM inference we have conducted at Berkeley over the past two years in the context of vLLM and SGLang, which are today the most popular open-source inference engines. In particular, I will describe some of the key techniques they introduced, PagedAttention and RadixAttention, which are now widely used by the majority of LLM inference engines. Finally, I will discuss the new architecture of vLLM.