vLLM, published by Kwon et al. at UC Berkeley in 2023, is an open-source inference framework that meaningfully cut the cost and improved the speed of serving LLMs. Its core innovation is PagedAttention, inspired by virtual-memory paging in operating systems: the KV Cache is split into blocks, slashing fragmentation to near-zero. Combined with Continuous Batching, a single GPU can handle far more concurrent requests; in practice 5-24x Throughput gains over vanilla HuggingFace are routinely reported. It has become one of the most widely deployed open-source LLM serving stacks, supporting Llama 3, Mixtral, Qwen and many other models out of the box.
External Links