Faster LLMs

Title: The Token/s Game and Beyond

Abstract: When interacting with large language models as a user, we often eagerly watching the response stream back with our happiness linked to a magical number: the number of tokens we received per second. In this talk, I will share some of our recent thoughts about optimizing LLM inference. Starting from how we think about LLM Inference as a system workload, but more focused on several recent results we have with the goal of ‘breaking” the quality/performance tradeoff — can we get something that is both faster AND better? While I have no idea about the answer, I am excited to share some of our recent thoughts and result to facilitate discussions.

The recording for this session is unavailable as it included unpublished results shared by the speaker.