Production AI engineering

Stop guessing your AI stack.

Set your use case and constraints below. Get ranked, good-fit LLM architectures — real models, real providers, real monthly costs — updating live as you decide.

291 curated models Live pricing · models.dev Managed vs self-host Reproducible numbers

What are you building?

Constraints

Requests / day: 50,000Peak concurrency: 8Latency requirement

Data boundary

Existing GPU (optional)Monthly budget USD (optional)

Optimize for

10 good-fit options · ~3900M tokens/mo · updates live

SortHow we rank & our sources

Llama 3.2 1b Instruct

Meta·open weights·128k ctx·quality 71*·released 2024-09·knowledge 2023-12

IaaS · rent GPU · 1× RTX 3090Best overall

$161/mo

$0.04/M tok

HH 83

cost92

latency95

quality68

• Self-host Llama 3.2 1b Instruct on 1× RTX 3090 (~2 GB).
• ~$161/mo dedicated · est. 457 tok/s decode.
• Scored for RAG Q&A (interactive): cost 92 · latency 95 · quality 68.

DIY guide VRAM math Get a review

Llama 3.1 8B

Meta·open weights·32k ctx·quality 71*·released 2025-01·knowledge 2023-12

Managed API

$390/mo

$0.10/M tok

HH 81

cost80

latency97

quality71

• Cerebras serves Llama 3.1 8B as a managed API — no infra to run.
• ~$390/mo at your volume · $0.10/M tokens blended.
• Scored for RAG Q&A (interactive): cost 80 · latency 97 · quality 71.

⚠ Data leaves your boundary — verify the provider DPA/compliance.

Price across 1 provider · your volume · list price

CerebrasAPIfastcheapest$390/moTry

Get a review

Qwen3 30B A3B

Alibaba·open weights·41k ctx·quality 77*·released 2025-04

IaaS · rent GPU · 2× RTX 3090

$321/mo

$0.08/M tok

HH 80

cost83

latency83

quality77

• Self-host Qwen3 30B A3B on 2× RTX 3090 (~36 GB, tensor parallel).
• ~$321/mo dedicated · est. 340 tok/s decode.
• Scored for RAG Q&A (interactive): cost 83 · latency 83 · quality 77.

⚠ MoE: all experts occupy memory even though only some compute.

DIY guide VRAM math Get a review

Llama 3.1 8B

Meta·open weights·16k ctx·quality 71*·released 2025-01·knowledge 2023-12

Managed APICheapest

$84/mo

$0.02/M tok

HH 80

cost100

latency75

quality71

• DeepInfra is the cheapest of 2 providers serving Llama 3.1 8B — full price stack below.
• ~$84/mo at your volume · $0.02/M tokens blended.
• Scored for RAG Q&A (interactive): cost 100 · latency 75 · quality 71.

⚠ Data leaves your boundary — verify the provider DPA/compliance. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 2 providers · your volume · list price

DeepInfraAPImediumcheapest$84/moTry
Novita AIAPImedium$96/mo1.1×Try

Get a review

Qwen2.5 7B Instruct

Alibaba·open weights·131k ctx·quality 78*·released 2025-04·knowledge 2024-04

Managed API

$195/mo

$0.05/M tok

HH 80

cost89

latency75

quality78

• SiliconFlow is the cheapest of 3 providers serving Qwen2.5 7B Instruct — full price stack below.
• ~$195/mo at your volume · $0.05/M tokens blended.
• Scored for RAG Q&A (interactive): cost 89 · latency 75 · quality 78.

⚠ Data leaves your boundary — verify the provider DPA/compliance. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 3 providers · your volume · list price

SiliconFlowAPImediumcheapest$195/moTry
Novita AIAPImedium$273/mo1.4×Try
Alibaba (Qwen)APImedium$998/mo5.1×Try

Get a review

GPT OSS 20B

OpenAI·open weights·131k ctx·quality 76*·released 2025-08

IaaS · rent GPU · 2× RTX 3090

$321/mo

$0.08/M tok

HH 80

cost83

latency83

quality76

• Self-host GPT OSS 20B on 2× RTX 3090 (~24 GB, tensor parallel).
• ~$321/mo dedicated · est. 312 tok/s decode.
• Scored for RAG Q&A (interactive): cost 83 · latency 83 · quality 76.

⚠ MoE: all experts occupy memory even though only some compute.

DIY guide VRAM math Get a review

THUDM/GLM-4-9B-0414

Zhipu AI·proprietary·33k ctx·quality 82*·released 2025-04

Managed API

$335/mo

$0.09/M tok

HH 79

cost82

latency72

quality82

• SiliconFlow serves THUDM/GLM-4-9B-0414 as a managed API — no infra to run.
• ~$335/mo at your volume · $0.09/M tokens blended.
• Scored for RAG Q&A (interactive): cost 82 · latency 72 · quality 82.

⚠ Data leaves your boundary — verify the provider DPA/compliance.

Price across 1 provider · your volume · list price

SiliconFlowAPImediumcheapest$335/moTry

Get a review

GPT OSS 120B

OpenAI·open weights·131k ctx·quality 61·released 2026-01·knowledge 2025-09

Managed APIFastest

$309/mo

$0.08/M tok

HH 79

cost83

latency99

quality61

• DeepInfra is the cheapest of 10 providers serving GPT OSS 120B — full price stack below.
• ~$309/mo at your volume · $0.08/M tokens blended.
• Scored for RAG Q&A (interactive): cost 83 · latency 99 · quality 61.

⚠ Data leaves your boundary — verify the provider DPA/compliance. Reasoning model: extra reasoning tokens raise real cost and latency. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 10 providers · your volume · list price

DeepInfraAPImediumcheapest$309/moTry
Novita AIAPImedium$315/mo1.0×Try
Databrickshyperscalermedium$406/mo1.3×Try
SiliconFlowAPImedium$435/mo1.4×Try
Basetenserverlessmedium$630/mo2.0×Try
Nebiusserverlessmedium$630/mo2.0×Try

Get a review

Step 3.5 Flash

StepFun·open weights·256k ctx·quality 68·released 2026-02

Managed API

$510/mo

$0.13/M tok

HH 78

cost77

latency93

quality68

• SiliconFlow serves Step 3.5 Flash as a managed API — no infra to run.
• ~$510/mo at your volume · $0.13/M tokens blended.
• Scored for RAG Q&A (interactive): cost 77 · latency 93 · quality 68.

⚠ Data leaves your boundary — verify the provider DPA/compliance. Reasoning model: extra reasoning tokens raise real cost and latency.

Price across 1 provider · your volume · list price

SiliconFlowAPImediumcheapest$510/moTry

Get a review

Claude Opus 4.8

Anthropic·proprietary·1M ctx·quality 96·released 2026-05

HyperscalerHighest quality

$31.5k/mo

$8.08/M tok

HH 58

cost24

latency35

quality96

• Google Vertex is the cheapest of 3 providers serving Claude Opus 4.8 — full price stack below.
• ~$31.5k/mo at your volume · $8.08/M tokens blended.
• Scored for RAG Q&A (interactive): cost 24 · latency 35 · quality 96.

Price across 3 providers · your volume · list price

Google Vertexhyperscalermediumcheapest$31.5k/moTry
AnthropicAPImedium$31.5k/moTry
AWS Bedrockhyperscalermedium$31.5k/moTry

Get a review

Disclosure: some managed-provider links are affiliate links. Ranking is driven only by your inputs and real numbers — never by who pays. A provider appears only when it fits.

We go deep across three lanes

🔩

Silicon

Hardware & serving

GPU selection, VRAM math, quantization tradeoffs, KV-cache tuning and cost-per-token analysis with reproducible numbers.

🧠

Models

Evals, architecture & tradeoffs

Open-weight model landscape, architecture comparisons, and closed-API vs self-hosted decision frameworks with real data.

🔧

Stack

Serving frameworks & orchestration

vLLM, SGLang, TGI and TensorRT-LLM head-to-heads, plus OSS orchestration with Dify, Windmill and n8n.

Latest benchmarks

All posts →

🔩Silicon Reproducible

You don't need an H100: matching GPU workload to hardware

A real diffusion-TTS pipeline case study. Why memory bandwidth — not parameter count — decides your GPU, and how to burst to cloud GPUs for $0.40 a render.

Apr 22, 2026 Read →

🔩Silicon Reproducible

vLLM in 2026: the complete production setup guide

Install, serve, benchmark and tune vLLM for production inference — with a fully reproducible config and real TTFT/throughput numbers on an RTX 4090.

Apr 15, 2026 Read →

Stop guessing your AI stack.

What are you building?

Constraints

Optimize for

Llama 3.2 1b Instruct

Llama 3.1 8B

Qwen3 30B A3B

Llama 3.1 8B

Qwen2.5 7B Instruct

GPT OSS 20B

THUDM/GLM-4-9B-0414

GPT OSS 120B

Step 3.5 Flash

Claude Opus 4.8

Email me this plan + the configs to ship it

Don't want to build it yourself?

Silicon

Models

Stack

Latest benchmarks

You don't need an H100: matching GPU workload to hardware

vLLM in 2026: the complete production setup guide

Stop guessing your AI stack.

What are you building?

Constraints

Optimize for

Llama 3.2 1b Instruct

Llama 3.1 8B

Qwen3 30B A3B

Llama 3.1 8B

Qwen2.5 7B Instruct

GPT OSS 20B

THUDM/GLM-4-9B-0414

GPT OSS 120B

Step 3.5 Flash

Claude Opus 4.8

Email me this plan + the configs to ship it

Don't want to build it yourself?

Silicon

Models

Stack

Latest benchmarks

You don't need an H100: matching GPU workload to hardware

vLLM in 2026: the complete production setup guide

Benchmark deep-dives, twice a month.