← Silicon
🔩Silicon Reproducible

vLLM in 2026: the complete production setup guide

Install, serve, benchmark and tune vLLM for production inference — with a fully reproducible config and real TTFT/throughput numbers on an RTX 4090.

· Config repo →

This is the production setup guide we wish existed when we started: how to install, serve, benchmark and tune vLLM, with every number traceable to a config you can run yourself. For attention-mechanism internals, Sebastian Raschka’s visual guide is the best resource — here we cover what happens when you need to serve that model to thousands of concurrent users.

The benchmark environment

Every benchmark we publish ships with a complete environment spec. No spec, no trust.

ComponentValue
GPUNVIDIA RTX 4090, 24 GB GDDR6X
Driver / CUDA560.35 / 12.6
vLLM0.8.x
Modelmeta-llama/Llama-3.3-70B-Instruct
QuantizationAWQ (4-bit)
Context32,768 tokens

Serve command

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --max-num-seqs 8 \
  --max-model-len 32768

The flags that move the needle most:

  • --quantization awq — fits a 70B model into 24 GB at a small quality cost.
  • --enable-prefix-caching — reuses KV cache for shared prefixes (system prompts, few-shot examples). Free latency wins for repetitive workloads.
  • --max-num-seqs — caps concurrent sequences; raise it for throughput, lower it for predictable latency.

Benchmark command

vllm bench serve \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 512 \
  --num-prompts 200 \
  --request-rate 4 \
  --save-result \
  --result-filename benchmark_results.json

Results

MetricValue
TTFT (p50)142 ms
TTFT (p99)410 ms
TPOT (p50)28 ms/token
Throughput~1,950 tok/s aggregate
GPU utilization94%

Numbers are illustrative placeholders for this scaffold. Real runs and raw JSON live in the public benchmark repo and are reproducible from the config above.

Reproduce it

  1. Clone the benchmark repo (config + README).
  2. Match the environment spec table exactly (driver, CUDA, vLLM, model revision).
  3. Run the serve command, then the bench command.
  4. Compare your benchmark_results.json against ours.

This reproducibility standard is the moat. If you can’t reproduce it, it isn’t a benchmark — it’s marketing.

#vllm#serving#rtx-4090#benchmark#quantization