Faster LLMs

Title: Next-Gen Long-Context LLM Inference with LMCache

Abstract: Long-context LLMs unlock powerful applications, but high prefill cost and latency remain major bottlenecks. This talk presents LMCache, a research-driven KV cache backend integrated into the vLLM Production Stack, enabling fast, scalable, and cost-efficient inference. We’ll showcase key techniques—like KV cache compression, blending, and disaggregation—and highlight real-world gains: 10× faster response, higher throughput, and lower cost. LMCache bridges cutting-edge research with practical deployment, making long-context LLMs truly usable at scale.

Bio: Junchen Jiang is an Associate Professor of Computer Science at the University of Chicago. His research interests are networked systems and their intersections with machine learning. He received his bachelor’s degree from Tsinghua University (Yao Class) in 2011 and his Ph.D. from CMU in 2017. He has received two Google Faculty Research Awards, an NSF CAREER Award, a Best Paper Award at EuroSys, and a CMU Computer Science Doctoral Dissertation Award.