The KV Cache stores the key (K) and value (V) vectors a Transformer Decoder computes for each Token, so they don't have to be recomputed at every new generation step. This caching is what makes Autoregressive generation tractable at all — without it, the model would have to reprocess the entire context for every new token. But it also comes at a cost: KV cache size grows linearly with context length and layer count, easily reaching gigabytes for large models, which makes it the dominant consumer of GPU memory. Efficient KV management — PagedAttention and its descendants — is the key to serving Long Context models economically.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2020
KV Cache
The cache that stores previously computed key/value vectors so the model doesn't recompute them every step.
- EN — English term
- KV Cache
- TR — Turkish term
- KV Önbelleği