Learn production AI engineering
A structured path from chip to workflow. Three lanes, beginner to advanced, grounded in real numbers — then put it to work in the Advisor.
🔩 Silicon
Hardware & servingGPU selection, VRAM math, quantization tradeoffs, KV-cache tuning and cost-per-token analysis with reproducible numbers.
- 1 Core
Why LLM inference is memory-bound
Decode speed is set by memory bandwidth, not FLOPs — the single fact that explains GPU choice and cost.
- 2 Core
VRAM math: weights + KV cache
Bytes/param by precision, plus KV cache that grows with context and batch. Compute it for any model.
Open → - 3 Core
Quantization tradeoffs
fp16 vs bf16 vs fp8 vs AWQ/GPTQ — what you save in memory and what you pay in quality.
- 4 Applied
KV cache, GQA and MLA
Why context costs memory, and how grouped-query and multi-head latent attention shrink it.
- 5 Applied
Choosing a GPU
VRAM, bandwidth and $/hr across consumer and datacenter cards — matched to your workload.
Open → - 6 Advanced
Throughput, batching & cost/token
Continuous batching, tokens/sec, and turning hardware into a real cost-per-million-tokens number.
🧠 Models
Evals, architecture & tradeoffsOpen-weight model landscape, architecture comparisons, and closed-API vs self-hosted decision frameworks with real data.
- 1 Core
The open-weight landscape
Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi — who makes what, and the licenses that matter.
- 2 Core
Dense vs Mixture-of-Experts
Total params set memory; active params set compute and quality. Why a 671B MoE can be cheap to run.
- 3 Applied
Reading benchmarks honestly
LMArena Elo, contamination, and what a quality score does and does not tell you.
Open → - 4 Applied
Reasoning models
Hidden reasoning tokens change real cost and latency — when they earn their keep.
- 5 Applied
Managed API vs self-host
The decision framework: data boundary, volume, latency and total cost — not just sticker price.
Open → - 6 Advanced
Capability fit
Context length, tool-calling, vision and code ability — matching the model to the job.
🔧 Stack
Serving frameworks & orchestrationvLLM, SGLang, TGI and TensorRT-LLM head-to-heads, plus OSS orchestration with Dify, Windmill and n8n.
- 1 Core
Serving engines compared
vLLM, SGLang, TGI and TensorRT-LLM — what each is good at and how to benchmark them.
Open → - 2 Core
The OpenAI-compatible layer
The portability trick: one API shape lets you swap managed and self-host with a base-url change.
- 3 Applied
The deployment spectrum
Managed API → Serverless GPU → IaaS (open weights) → DIY, and where the cost crossover sits.
Open → - 4 Applied
RAG vs fine-tuning vs prompting
Which lever to pull for accuracy, and the order to try them in.
- 5 Applied
Orchestration
Wiring tools, retrieval and agents with Dify, n8n and Windmill without lock-in.
- 6 Advanced
Observability & cost control
Token accounting, caching, fallbacks and the guardrails that keep a bill predictable.
Ready to apply it?
The Advisor takes your use case and constraints and returns ranked, costed architectures — managed vs self-host — using exactly these principles.
Open the AdvisorQuestions
Who is this for?
Engineers and technical leaders putting LLMs into production who want the real tradeoffs — hardware, models and serving — not vendor marketing.
Is it free?
Yes. The curriculum, the Advisor and the tools are free. We make money from disclosed managed-provider referrals and paid architecture reviews.
How does this connect to the Advisor?
Learn the concepts here, then the Advisor turns your use case and constraints into a ranked, costed architecture using the same principles.