vLLM in 2026: the complete production setup guide
Install, serve, benchmark and tune vLLM for production inference — with a fully reproducible config and real TTFT/throughput numbers on an RTX 4090.
This is the production setup guide we wish existed when we started: how to install, serve, benchmark and tune vLLM, with every number traceable to a config you can run yourself. For attention-mechanism internals, Sebastian Raschka’s visual guide is the best resource — here we cover what happens when you need to serve that model to thousands of concurrent users.
The benchmark environment
Every benchmark we publish ships with a complete environment spec. No spec, no trust.
| Component | Value |
|---|---|
| GPU | NVIDIA RTX 4090, 24 GB GDDR6X |
| Driver / CUDA | 560.35 / 12.6 |
| vLLM | 0.8.x |
| Model | meta-llama/Llama-3.3-70B-Instruct |
| Quantization | AWQ (4-bit) |
| Context | 32,768 tokens |
Serve command
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--quantization awq \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--max-num-seqs 8 \
--max-model-len 32768
The flags that move the needle most:
--quantization awq— fits a 70B model into 24 GB at a small quality cost.--enable-prefix-caching— reuses KV cache for shared prefixes (system prompts, few-shot examples). Free latency wins for repetitive workloads.--max-num-seqs— caps concurrent sequences; raise it for throughput, lower it for predictable latency.
Benchmark command
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model meta-llama/Llama-3.3-70B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 512 \
--num-prompts 200 \
--request-rate 4 \
--save-result \
--result-filename benchmark_results.json
Results
| Metric | Value |
|---|---|
| TTFT (p50) | 142 ms |
| TTFT (p99) | 410 ms |
| TPOT (p50) | 28 ms/token |
| Throughput | ~1,950 tok/s aggregate |
| GPU utilization | 94% |
Numbers are illustrative placeholders for this scaffold. Real runs and raw JSON live in the public benchmark repo and are reproducible from the config above.
Reproduce it
- Clone the benchmark repo (config + README).
- Match the environment spec table exactly (driver, CUDA, vLLM, model revision).
- Run the serve command, then the bench command.
- Compare your
benchmark_results.jsonagainst ours.
This reproducibility standard is the moat. If you can’t reproduce it, it isn’t a benchmark — it’s marketing.