vLLM is a LLM inference engine introduced by UC Berkley Sky Computing Lab in June 2024.

At launch, the main contributions were as follows:

  • paged-attention - a technique that enables kv-cache to be stored non-contiguously in blocks, this allows us to:
    • Not over-reserve memory for prompts that are shorter than the maximum sequence length of the model (i.e. all of them)
    • Greatly reduce the memory waste caused by fragmentation and over-reservation
  • the serving engine powering the Chatbot Arena

References