The talk will cover the theory necessary to understand how to serve LLMs. The talk covers the math behind transformers inference in an accessible and light way. By the end of the talk, attendants will learn:
The talk will cover:
Did you pay attention? (4 min). A short review of the attention mechanism and how to count parameters in a transformer-based model.
Get to know your params (8 min). The math-y section of the talk, explaining how to translate parameter counts into memory and compute requirements.
Prefill and Decoding (8 min) Explains that inference happens in two steps (prefill and decoding) and how KV-cache exploits this to make decoding faster. Common metrics to measure inference performance, like time-to-first-token and token-per-second.
Context and batch size (5 min) Adds to the picture the sequence length, as well as the number of requests to process in parallel. Explains how LLM servers, like vLLM, use techniques like Paged Attention to optimise GPU usage
Conclusion (5 min) Wrap up, Q&A.