KV Cache

The KV Cache stores the key (K) and value (V) vectors a Transformer Decoder computes for each Token, so they don't have to be recomputed at every new generation step. This caching is what makes Autoregressive generation tractable at all — without it, the model would have to reprocess the entire context for every new token. But it also comes at a cost: KV cache size grows linearly with context length and layer count, easily reaching gigabytes for large models, which makes it the dominant consumer of GPU memory. Efficient KV management — PagedAttention and its descendants — is the key to serving Long Context models economically.