Title: Towards Fast and Affordable Serving Systems for Large Language Models
Abstract: In the rapidly evolving field of generative artificial intelligence, efficient deployment of large language models (LLMs) is a critical challenge. In this talk, I will introduce our three innovative approaches to enhancing the efficiency and cost-effectiveness of LLM inference and serving systems. First, I will present SpecInfer, the inaugural tree-based speculative inference system that reduces LLM serving latency by 1.5-3.5x compared to existing solutions by leveraging a novel token tree speculation and verification mechanism. Next, I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will exhibit Mirage, a superoptimizer that automatically discovers highly-optimized GPU implementations for LLMs and beyond, which might even be faster than existing expert-designed implementations like FlashAttention.