Question 1

How do I calculate VRAM for serving an LLM?

Accepted Answer

Total VRAM = model weights + KV cache + overhead. Weights = parameters × bytes-per-parameter (2 for fp16, 1 for fp8, 0.5 for int4). KV cache = 2 × layers × kv_heads × head_dim × kv_bytes × context_length × concurrent_sequences. Add 10–20% overhead for activations and CUDA graphs.

Question 2

Why does context length affect VRAM so much?

Accepted Answer

The KV cache grows linearly with context length AND with the number of concurrent sequences. A 70B model at 32k context can need ~10 GB of KV cache per sequence. At a batch of 8 that is ~80 GB — often more than the weights themselves. This is why fp8 KV cache and prefix caching matter so much in production.

Question 3

Does quantization reduce KV cache memory?

Accepted Answer

Weight quantization (AWQ, GPTQ, fp8) shrinks the weights but not the KV cache. To shrink the KV cache you need a separate fp8 KV cache setting (--kv-cache-dtype fp8 in vLLM), which roughly halves it at a small accuracy cost.

Question 4

What does tensor parallel do to the memory math?

Accepted Answer

Tensor parallelism splits both weights and KV cache across GPUs, so per-GPU memory is roughly total ÷ number of GPUs. The calculator finds the smallest power-of-two GPU count that fits your target card.

Question 5

How accurate are these estimates?

Accepted Answer

Within ~10–20% for standard GQA decoder models. MoE models (only active experts load matters for compute, but all experts occupy memory) and MLA architectures (DeepSeek) compress the KV cache differently and will not match these formulas exactly. Always validate against a real vllm serve run.

VRAM & KV-cache Calculator

How the VRAM math works

Weights

KV cache

Frequently asked questions

How do I calculate VRAM for serving an LLM?

Why does context length affect VRAM so much?

Does quantization reduce KV cache memory?

What does tensor parallel do to the memory math?

How accurate are these estimates?