Learn production AI engineering

A structured path from chip to workflow. Three lanes, beginner to advanced, grounded in real numbers — then put it to work in the Advisor.

🔩 Silicon

Hardware & serving

GPU selection, VRAM math, quantization tradeoffs, KV-cache tuning and cost-per-token analysis with reproducible numbers.

1 Core

Why LLM inference is memory-bound

Decode speed is set by memory bandwidth, not FLOPs — the single fact that explains GPU choice and cost.
2 Core

VRAM math: weights + KV cache

Bytes/param by precision, plus KV cache that grows with context and batch. Compute it for any model.
Open →
3 Core

Quantization tradeoffs

fp16 vs bf16 vs fp8 vs AWQ/GPTQ — what you save in memory and what you pay in quality.
4 Applied

KV cache, GQA and MLA

Why context costs memory, and how grouped-query and multi-head latent attention shrink it.
5 Applied

Choosing a GPU

VRAM, bandwidth and $/hr across consumer and datacenter cards — matched to your workload.
Open →
6 Advanced

Throughput, batching & cost/token

Continuous batching, tokens/sec, and turning hardware into a real cost-per-million-tokens number.

🧠 Models

Evals, architecture & tradeoffs

Open-weight model landscape, architecture comparisons, and closed-API vs self-hosted decision frameworks with real data.

1 Core

The open-weight landscape

Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi — who makes what, and the licenses that matter.
2 Core

Dense vs Mixture-of-Experts

Total params set memory; active params set compute and quality. Why a 671B MoE can be cheap to run.
3 Applied

Reading benchmarks honestly

LMArena Elo, contamination, and what a quality score does and does not tell you.
Open →
4 Applied

Reasoning models

Hidden reasoning tokens change real cost and latency — when they earn their keep.
5 Applied

Managed API vs self-host

The decision framework: data boundary, volume, latency and total cost — not just sticker price.
Open →
6 Advanced

Capability fit

Context length, tool-calling, vision and code ability — matching the model to the job.

🔧 Stack

Serving frameworks & orchestration

vLLM, SGLang, TGI and TensorRT-LLM head-to-heads, plus OSS orchestration with Dify, Windmill and n8n.

1 Core

Serving engines compared

vLLM, SGLang, TGI and TensorRT-LLM — what each is good at and how to benchmark them.
Open →
2 Core

The OpenAI-compatible layer

The portability trick: one API shape lets you swap managed and self-host with a base-url change.
3 Applied

The deployment spectrum

Managed API → Serverless GPU → IaaS (open weights) → DIY, and where the cost crossover sits.
Open →
4 Applied

RAG vs fine-tuning vs prompting

Which lever to pull for accuracy, and the order to try them in.
5 Applied

Orchestration

Wiring tools, retrieval and agents with Dify, n8n and Windmill without lock-in.
6 Advanced

Observability & cost control

Token accounting, caching, fallbacks and the guardrails that keep a bill predictable.

Ready to apply it?

The Advisor takes your use case and constraints and returns ranked, costed architectures — managed vs self-host — using exactly these principles.

Open the Advisor

Questions

Who is this for?

Engineers and technical leaders putting LLMs into production who want the real tradeoffs — hardware, models and serving — not vendor marketing.

Is it free?

Yes. The curriculum, the Advisor and the tools are free. We make money from disclosed managed-provider referrals and paid architecture reviews.

How does this connect to the Advisor?

Learn the concepts here, then the Advisor turns your use case and constraints into a ranked, costed architecture using the same principles.

Learn production AI engineering

🔩 Silicon

Why LLM inference is memory-bound

VRAM math: weights + KV cache

Quantization tradeoffs

KV cache, GQA and MLA

Choosing a GPU

Throughput, batching & cost/token

🧠 Models

The open-weight landscape

Dense vs Mixture-of-Experts

Reading benchmarks honestly

Reasoning models

Managed API vs self-host

Capability fit

🔧 Stack

Serving engines compared

The OpenAI-compatible layer

The deployment spectrum

RAG vs fine-tuning vs prompting

Orchestration

Observability & cost control

Ready to apply it?

Questions

Who is this for?

Is it free?

How does this connect to the Advisor?