LLM Inference Arithmetics: the Theory behind Model Serving

Luca Baggi

Wednesday 14:30 in Titanium3

The talk will cover the theory necessary to understand how to serve LLMs. The talk covers the math behind transformers inference in an accessible and light way. By the end of the talk, attendants will learn:

  1. How to count the parameters in an LLM, especially the ones in the attention layers.
  2. The difference between compute and memory in the context of LLM inference.
  3. That LLM inference is made up of two parts: prefill and decoding.
  4. What is an LLM server, and what features they implement to optimise GPU memory usage and reduce latency
  5. How batching affects your inference metrics, like time-to-first-token.

The talk will cover:

Did you pay attention? (4 min). A short review of the attention mechanism and how to count parameters in a transformer-based model.

Get to know your params (8 min). The math-y section of the talk, explaining how to translate parameter counts into memory and compute requirements.

Prefill and Decoding (8 min) Explains that inference happens in two steps (prefill and decoding) and how KV-cache exploits this to make decoding faster. Common metrics to measure inference performance, like time-to-first-token and token-per-second.

Context and batch size (5 min) Adds to the picture the sequence length, as well as the number of requests to process in parallel. Explains how LLM servers, like vLLM, use techniques like Paged Attention to optimise GPU usage

Conclusion (5 min) Wrap up, Q&A.

Luca Baggi

AI Engineer at xtream by day, and open source maintainer by night. I strive to be an active part of the Python and PyData communities - e.g. as an organiser of PyData Milan. Feel free to reach out!