Inference Optimization

KV Cache

Making LLM inference fast by caching what you already computed

The Inference Bottleneck

When ChatGPT generates a response, it produces one token at a time — word by word, from left to right. This is called autoregressive decoding, and it is fundamentally sequential. You cannot generate token N+1 until you have finished generating token N. This sequential nature creates a massive bottleneck.

Here is the problem: at every single step, the model needs to compute attention between the new token and ALL previous tokens. If you are generating the 100th token, the model re-processes tokens 1 through 99 just to produce token 100. If you are generating the 1000th token, it re-processes 999 previous tokens. This is O(n²) total computation — it gets slower and slower as the response grows. Without caching, generating a 1000-token response requires roughly 500,000 attention computations. That is the inference bottleneck.

How Transformers Generate Text

To understand why KV Cache matters, you need to see exactly what happens during each decoding step. The Transformer generates text autoregressively: it predicts one token, appends it to the sequence, and feeds the entire sequence back in to predict the next one.

At each step, ALL previous tokens are fed back through attention — even though most were already processed

At step 1, the model processes "The" and predicts "cat". At step 2, it processes "The cat" to predict "sat". At step 3, it processes "The cat sat" to predict "on". Notice that at each step, we are re-computing the Key (K) and Value (V) matrices for tokens we have already seen. Token "The" gets its K and V computed at step 1, again at step 2, again at step 3, and so on. This redundant computation is the performance killer.

The Key Insight

Here is the observation that changes everything: in the attention formula Attention(Q, K, V) = softmax(QKᵀ/√dₖ)·V, the K and V matrices for previous tokens do not change between steps. Once you have computed the Key and Value vectors for token "The" at step 1, those exact same vectors are what you need at step 2, step 3, and step 100. They are deterministic functions of the input — same input, same output.

Why K and V stay the same

K = x·Wₖ and V = x·Wᵥ where x is the input embedding and Wₖ, Wᵥ are fixed learned weight matrices. Since the input tokens never change and the weights are frozen at inference time, K and V for previous positions are identical at every step. Only the new token's Q matters, because it is the one "asking" for attention.

So instead of recomputing K and V from scratch every step, we can store (cache) them after the first computation. At each new step, we only compute K and V for the single new token, then concatenate them with the cached values. This turns O(n²) attention computation into O(n) per step — a massive speedup.

KV Cache in Action

The 3D visualization below shows the contrast directly. On the left, every token is recomputed at every step (red glow = wasted work). On the right, only the new token is computed (green glow), while previous tokens stay dim in the cache. Click to pause/resume the animation.

Left: without cache, all tokens recomputed each step (red). Right: with cache, only the new token computed (green).

The Math: With and Without Cache

Let us make this precise. For a single attention layer with model dimension d and sequence length n:

Without cache: Step t computes K,V for tokens 1..t → O(t·d) per step → O(n²·d) total

With KV Cache, each step only computes K and V for the one new token:

With cache: Step t computes K,V for token t only → O(d) per step → O(n·d) total

The memory cost of the cache itself is modest: storing K and V for all previous tokens costs 2 × n × d × num_layers floating point numbers. For a 70B model with 80 layers, d=8192, and sequence length 4096, the KV cache is about 40 GB in fp16. This is significant but far cheaper than the O(n²) compute it saves.

Memory Layout: O(n²) vs O(n)

The diagram below shows the computational work at each step. On the left (without cache), every step processes more tokens than the last — the work grows quadratically. On the right (with cache), every step processes exactly one new token — the work is constant.

Without cache: 1+2+3+4+5 = 15 operations. With cache: 1+1+1+1+1 = 5 operations. The gap grows with sequence length.

For short sequences (under 10 tokens), the difference is small. But for long-form generation — writing essays, code, or multi-turn conversations — the savings are enormous. At 100 tokens, KV Cache reduces computation by 50x. At 1000 tokens, by 500x. This is why every production LLM serving system uses KV Cache.

Multi-Query Attention (MQA)

Standard Multi-Head Attention (MHA) uses separate K and V projections for each attention head. With 32 heads, that means 32 sets of K and V to cache — a substantial memory footprint. In 2019, Noam Shazeer proposed a radical simplification: Multi-Query Attention (MQA).

In MQA, all heads share a single K and V projection. Each head still has its own Q (so they can "ask" different questions), but they all read from the same K and V (the same "answers"). This shrinks the KV cache by a factor equal to the number of heads — typically 8x to 32x. The trade-off is a small quality degradation, but the inference speedup is enormous, especially for long sequences.

Grouped-Query Attention (GQA)

GQA, introduced in the Llama 2 paper, finds the sweet spot between MHA and MQA. Instead of giving each head its own K and V (MHA) or forcing all heads to share one set (MQA), GQA divides heads into groups. Each group shares one set of K and V projections.

With 8 groups and 32 heads, GQA uses 8 sets of K and V — 4x less than MHA but 8x more than MQA. This gives you most of the memory savings of MQA while preserving nearly all the quality of MHA. Llama 2 70B uses GQA with 8 groups, and Llama 3 continues this pattern.

Think of it as a compromise between individual offices and an open floor plan. MHA gives each head a private office (best focus, most space). MQA puts everyone in one big room (least space, some distraction). GQA creates shared offices for small teams — less space than private offices, but better focus than the open floor plan.

MHA vs MQA vs GQA: Visual Comparison

The diagram below shows how Q, K, and V are organized in each variant. Notice how the KV cache size shrinks from MHA (4 sets) to GQA (2 sets) to MQA (1 set). In real models with 32-128 heads, the difference is even more dramatic.

MHA: every head has its own K,V. MQA: all heads share one K,V. GQA: groups share K,V.

PagedAttention & vLLM

Even with KV Cache, memory management remains a challenge. Traditional systems pre-allocate a contiguous block of memory for each request's maximum possible sequence length. Since most sequences are shorter than the maximum, 60-80% of this memory is wasted — a problem called fragmentation.

PagedAttention, introduced in the vLLM paper (2023), solves this by managing KV cache in fixed-size pages — exactly like virtual memory in an operating system. Each page holds KV vectors for a fixed number of tokens (e.g., 16). Pages are allocated on demand and do not need to be contiguous. When multiple requests share a common prefix (like a system prompt), they can share the same physical pages. vLLM with PagedAttention achieves 2-4x higher throughput than traditional serving systems.

Traditional: pre-allocate max length memory (60-80% wasted). PagedAttention: allocate small fixed-size pages on demand (near-zero waste).

Key Takeaways

1

KV Cache stores previously computed Key and Value matrices, avoiding redundant recomputation at each decoding step.

2

Without cache, inference is O(n²). With cache, it becomes O(n) per step — hundreds of times faster for long sequences.

3

MQA shares K and V across all heads, reducing cache size by the number of heads (8-32x). GQA is the balanced middle ground used in Llama 2/3.

4

PagedAttention (vLLM) manages KV cache memory like virtual memory, reducing fragmentation from 60-80% to under 4%.

5

Every production LLM serving system uses KV Cache. It is not optional — it is the difference between usable and unusable inference speed.

Explore Related Topics

Continue your journey through LLM fundamentals: