Learn production AI engineering

A structured path from chip to workflow. Three lanes, beginner to advanced, grounded in real numbers — then put it to work in the Advisor.

🔩 Silicon

Hardware & serving

GPU selection, VRAM math, quantization tradeoffs, KV-cache tuning and cost-per-token analysis with reproducible numbers.

  1. 1 Core

    Why LLM inference is memory-bound

    Decode speed is set by memory bandwidth, not FLOPs — the single fact that explains GPU choice and cost.

  2. 2 Core

    VRAM math: weights + KV cache

    Bytes/param by precision, plus KV cache that grows with context and batch. Compute it for any model.

    Open →
  3. 3 Core

    Quantization tradeoffs

    fp16 vs bf16 vs fp8 vs AWQ/GPTQ — what you save in memory and what you pay in quality.

  4. 4 Applied

    KV cache, GQA and MLA

    Why context costs memory, and how grouped-query and multi-head latent attention shrink it.

  5. 5 Applied

    Choosing a GPU

    VRAM, bandwidth and $/hr across consumer and datacenter cards — matched to your workload.

    Open →
  6. 6 Advanced

    Throughput, batching & cost/token

    Continuous batching, tokens/sec, and turning hardware into a real cost-per-million-tokens number.

🧠 Models

Evals, architecture & tradeoffs

Open-weight model landscape, architecture comparisons, and closed-API vs self-hosted decision frameworks with real data.

  1. 1 Core

    The open-weight landscape

    Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi — who makes what, and the licenses that matter.

  2. 2 Core

    Dense vs Mixture-of-Experts

    Total params set memory; active params set compute and quality. Why a 671B MoE can be cheap to run.

  3. 3 Applied

    Reading benchmarks honestly

    LMArena Elo, contamination, and what a quality score does and does not tell you.

    Open →
  4. 4 Applied

    Reasoning models

    Hidden reasoning tokens change real cost and latency — when they earn their keep.

  5. 5 Applied

    Managed API vs self-host

    The decision framework: data boundary, volume, latency and total cost — not just sticker price.

    Open →
  6. 6 Advanced

    Capability fit

    Context length, tool-calling, vision and code ability — matching the model to the job.

🔧 Stack

Serving frameworks & orchestration

vLLM, SGLang, TGI and TensorRT-LLM head-to-heads, plus OSS orchestration with Dify, Windmill and n8n.

  1. 1 Core

    Serving engines compared

    vLLM, SGLang, TGI and TensorRT-LLM — what each is good at and how to benchmark them.

    Open →
  2. 2 Core

    The OpenAI-compatible layer

    The portability trick: one API shape lets you swap managed and self-host with a base-url change.

  3. 3 Applied

    The deployment spectrum

    Managed API → Serverless GPU → IaaS (open weights) → DIY, and where the cost crossover sits.

    Open →
  4. 4 Applied

    RAG vs fine-tuning vs prompting

    Which lever to pull for accuracy, and the order to try them in.

  5. 5 Applied

    Orchestration

    Wiring tools, retrieval and agents with Dify, n8n and Windmill without lock-in.

  6. 6 Advanced

    Observability & cost control

    Token accounting, caching, fallbacks and the guardrails that keep a bill predictable.

Ready to apply it?

The Advisor takes your use case and constraints and returns ranked, costed architectures — managed vs self-host — using exactly these principles.

Open the Advisor

Questions

Who is this for?

Engineers and technical leaders putting LLMs into production who want the real tradeoffs — hardware, models and serving — not vendor marketing.

Is it free?

Yes. The curriculum, the Advisor and the tools are free. We make money from disclosed managed-provider referrals and paid architecture reviews.

How does this connect to the Advisor?

Learn the concepts here, then the Advisor turns your use case and constraints into a ranked, costed architecture using the same principles.