Production AI engineering

Stop guessing your AI stack.

Set your use case and constraints below. Get ranked, good-fit LLM architectures — real models, real providers, real monthly costs — updating live as you decide.

291 curated models Live pricing · models.dev Managed vs self-host Reproducible numbers
1

What are you building?

2

Constraints

3

Optimize for

10 good-fit options · ~3900M tokens/mo · updates live
How we rank & our sources
1

Llama 3.2 1b Instruct

Meta·open weights·128k ctx·quality 71*·released 2024-09·knowledge 2023-12
IaaS · rent GPU · 1× RTX 3090Best overall
$161/mo
$0.04/M tok
HH 83
cost92
latency95
quality68
  • Self-host Llama 3.2 1b Instruct on 1× RTX 3090 (~2 GB).
  • ~$161/mo dedicated · est. 457 tok/s decode.
  • Scored for RAG Q&A (interactive): cost 92 · latency 95 · quality 68.
2

Llama 3.1 8B

Meta·open weights·32k ctx·quality 71*·released 2025-01·knowledge 2023-12
Managed API
$390/mo
$0.10/M tok
HH 81
cost80
latency97
quality71
  • Cerebras serves Llama 3.1 8B as a managed API — no infra to run.
  • ~$390/mo at your volume · $0.10/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 80 · latency 97 · quality 71.

Data leaves your boundary — verify the provider DPA/compliance.

Price across 1 provider · your volume · list price
  • CerebrasAPIfastcheapest$390/moTry
3

Qwen3 30B A3B

Alibaba·open weights·41k ctx·quality 77*·released 2025-04
IaaS · rent GPU · 2× RTX 3090
$321/mo
$0.08/M tok
HH 80
cost83
latency83
quality77
  • Self-host Qwen3 30B A3B on 2× RTX 3090 (~36 GB, tensor parallel).
  • ~$321/mo dedicated · est. 340 tok/s decode.
  • Scored for RAG Q&A (interactive): cost 83 · latency 83 · quality 77.

MoE: all experts occupy memory even though only some compute.

4

Llama 3.1 8B

Meta·open weights·16k ctx·quality 71*·released 2025-01·knowledge 2023-12
Managed APICheapest
$84/mo
$0.02/M tok
HH 80
cost100
latency75
quality71
  • DeepInfra is the cheapest of 2 providers serving Llama 3.1 8B — full price stack below.
  • ~$84/mo at your volume · $0.02/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 100 · latency 75 · quality 71.

Data leaves your boundary — verify the provider DPA/compliance. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 2 providers · your volume · list price
  • DeepInfraAPImediumcheapest$84/moTry
  • Novita AIAPImedium$96/mo1.1×Try
5

Qwen2.5 7B Instruct

Alibaba·open weights·131k ctx·quality 78*·released 2025-04·knowledge 2024-04
Managed API
$195/mo
$0.05/M tok
HH 80
cost89
latency75
quality78
  • SiliconFlow is the cheapest of 3 providers serving Qwen2.5 7B Instruct — full price stack below.
  • ~$195/mo at your volume · $0.05/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 89 · latency 75 · quality 78.

Data leaves your boundary — verify the provider DPA/compliance. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 3 providers · your volume · list price
  • SiliconFlowAPImediumcheapest$195/moTry
  • Novita AIAPImedium$273/mo1.4×Try
  • Alibaba (Qwen)APImedium$998/mo5.1×Try
6

GPT OSS 20B

OpenAI·open weights·131k ctx·quality 76*·released 2025-08
IaaS · rent GPU · 2× RTX 3090
$321/mo
$0.08/M tok
HH 80
cost83
latency83
quality76
  • Self-host GPT OSS 20B on 2× RTX 3090 (~24 GB, tensor parallel).
  • ~$321/mo dedicated · est. 312 tok/s decode.
  • Scored for RAG Q&A (interactive): cost 83 · latency 83 · quality 76.

MoE: all experts occupy memory even though only some compute.

7

THUDM/GLM-4-9B-0414

Zhipu AI·proprietary·33k ctx·quality 82*·released 2025-04
Managed API
$335/mo
$0.09/M tok
HH 79
cost82
latency72
quality82
  • SiliconFlow serves THUDM/GLM-4-9B-0414 as a managed API — no infra to run.
  • ~$335/mo at your volume · $0.09/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 82 · latency 72 · quality 82.

Data leaves your boundary — verify the provider DPA/compliance.

Price across 1 provider · your volume · list price
  • SiliconFlowAPImediumcheapest$335/moTry
8

GPT OSS 120B

OpenAI·open weights·131k ctx·quality 61·released 2026-01·knowledge 2025-09
Managed APIFastest
$309/mo
$0.08/M tok
HH 79
cost83
latency99
quality61
  • DeepInfra is the cheapest of 10 providers serving GPT OSS 120B — full price stack below.
  • ~$309/mo at your volume · $0.08/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 83 · latency 99 · quality 61.

Data leaves your boundary — verify the provider DPA/compliance. Reasoning model: extra reasoning tokens raise real cost and latency. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 10 providers · your volume · list price
  • DeepInfraAPImediumcheapest$309/moTry
  • Novita AIAPImedium$315/mo1.0×Try
  • Databrickshyperscalermedium$406/mo1.3×Try
  • SiliconFlowAPImedium$435/mo1.4×Try
  • Basetenserverlessmedium$630/mo2.0×Try
  • Nebiusserverlessmedium$630/mo2.0×Try
9

Step 3.5 Flash

StepFun·open weights·256k ctx·quality 68·released 2026-02
Managed API
$510/mo
$0.13/M tok
HH 78
cost77
latency93
quality68
  • SiliconFlow serves Step 3.5 Flash as a managed API — no infra to run.
  • ~$510/mo at your volume · $0.13/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 77 · latency 93 · quality 68.

Data leaves your boundary — verify the provider DPA/compliance. Reasoning model: extra reasoning tokens raise real cost and latency.

Price across 1 provider · your volume · list price
  • SiliconFlowAPImediumcheapest$510/moTry
10

Claude Opus 4.8

Anthropic·proprietary·1M ctx·quality 96·released 2026-05
HyperscalerHighest quality
$31.5k/mo
$8.08/M tok
HH 58
cost24
latency35
quality96
  • Google Vertex is the cheapest of 3 providers serving Claude Opus 4.8 — full price stack below.
  • ~$31.5k/mo at your volume · $8.08/M tokens blended.
  • Scored for RAG Q&A (interactive): cost 24 · latency 35 · quality 96.

Data leaves your boundary — verify the provider DPA/compliance. Reasoning model: extra reasoning tokens raise real cost and latency. Listed cost is per-token list price; committed-use, batch and enterprise contracts differ.

Price across 3 providers · your volume · list price
  • Google Vertexhyperscalermediumcheapest$31.5k/moTry
  • AnthropicAPImedium$31.5k/moTry
  • AWS Bedrockhyperscalermedium$31.5k/moTry

Email me this plan + the configs to ship it

Get this recommendation, the serve commands, and a reproducible benchmark config. No spam.

Don't want to build it yourself?

We'll implement this exact architecture — Llama 3.2 1b Instruct on your own GPUs — production-ready and tested, with a fixed-scope review.

Disclosure: some managed-provider links are affiliate links. Ranking is driven only by your inputs and real numbers — never by who pays. A provider appears only when it fits.

We go deep across three lanes

Latest benchmarks

All posts →

Benchmark deep-dives, twice a month.

Reproducible hardware and serving benchmarks, model decision frameworks, and OSS stack walkthroughs. No hype, no spam.