← Tools

VRAM & KV-cache Calculator

Plan your inference deployment. Estimate weights, KV cache and overhead for any open-weight model, then check whether it fits your GPU — before you rent it.

Total VRAM required
130 GB
Fits on 8× RTX 4090
Model weights
32.9 GB
KV cache (32,768 ctx × 8)
80.0 GB
Overhead (15%)
16.9 GB

Per-GPU load: 16.2 GB across 8 GPU(s) (tensor parallel).

KV per token: 320.0 KB. Halve it with fp8 KV cache.

Estimates use standard GQA formulas (weights = params × bytes; KV = 2 × layers × kv_heads × head_dim × kv_bytes × ctx × batch). Accurate within ~10–20%; MoE and MLA architectures differ.

How the VRAM math works

Three things occupy GPU memory during LLM inference: the model weights, the KV cache that stores attention state for every token in flight, and a buffer of overhead for activations and CUDA graphs. Get any one wrong and you either crash with OOM or pay for a GPU twice the size you needed.

Weights

Weight memory is simply parameter count × bytes per parameter. Quantization is the biggest lever: a 70B model needs ~141 GB in fp16 but only ~35 GB in int4 (AWQ/GPTQ).

KV cache

The KV cache is the part most people underestimate. It scales with both context length and the number of concurrent sequences. With grouped-query attention (GQA), the number of KV heads — not query heads — drives the size, which is why modern models keep KV heads low.

Frequently asked questions

How do I calculate VRAM for serving an LLM?

Total VRAM = model weights + KV cache + overhead. Weights = parameters × bytes-per-parameter (2 for fp16, 1 for fp8, 0.5 for int4). KV cache = 2 × layers × kv_heads × head_dim × kv_bytes × context_length × concurrent_sequences. Add 10–20% overhead for activations and CUDA graphs.

Why does context length affect VRAM so much?

The KV cache grows linearly with context length AND with the number of concurrent sequences. A 70B model at 32k context can need ~10 GB of KV cache per sequence. At a batch of 8 that is ~80 GB — often more than the weights themselves. This is why fp8 KV cache and prefix caching matter so much in production.

Does quantization reduce KV cache memory?

Weight quantization (AWQ, GPTQ, fp8) shrinks the weights but not the KV cache. To shrink the KV cache you need a separate fp8 KV cache setting (--kv-cache-dtype fp8 in vLLM), which roughly halves it at a small accuracy cost.

What does tensor parallel do to the memory math?

Tensor parallelism splits both weights and KV cache across GPUs, so per-GPU memory is roughly total ÷ number of GPUs. The calculator finds the smallest power-of-two GPU count that fits your target card.

How accurate are these estimates?

Within ~10–20% for standard GQA decoder models. MoE models (only active experts load matters for compute, but all experts occupy memory) and MLA architectures (DeepSeek) compress the KV cache differently and will not match these formulas exactly. Always validate against a real vllm serve run.