ML Notes

❯

❯

vLLM

14 Mar 20251 min read

vLLM is a LLM inference engine introduced by UC Berkley Sky Computing Lab in June 2024.

At launch, the main contributions were as follows:

paged-attention - a technique that enables kv-cache to be stored non-contiguously in blocks, this allows us to:
- Not over-reserve memory for prompts that are shorter than the maximum sequence length of the model (i.e. all of them)
- Greatly reduce the memory waste caused by fragmentation and over-reservation
the serving engine powering the Chatbot Arena

References

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Graph View

Backlinks

SGLang

Created with Quartz v4.5.0 © 2025