vLLM is a LLM inference engine introduced by UC Berkley Sky Computing Lab in June 2024.
At launch, the main contributions were as follows:
- paged-attention - a technique that enables kv-cache to be stored non-contiguously in blocks, this allows us to:
- Not over-reserve memory for prompts that are shorter than the maximum sequence length of the model (i.e. all of them)
- Greatly reduce the memory waste caused by fragmentation and over-reservation
- the serving engine powering the Chatbot Arena
References