← Silicon
🔩Silicon Reproducible

You don't need an H100: matching GPU workload to hardware

A real diffusion-TTS pipeline case study. Why memory bandwidth — not parameter count — decides your GPU, and how to burst to cloud GPUs for $0.40 a render.

· Config repo →

The most powerful technical content isn’t a tutorial — it’s a documented real-world decision made under real constraints. Here’s one: a solo creator building an automated YouTube pipeline on an M1 Pro 16 GB, cost-conscious, hitting swap and watching renders take days.

The fix has nothing to do with LLM serving, yet the decision framework is identical to what an AI infrastructure engineer uses every day.

The core insight

Diffusion-based TTS and LLM serving have fundamentally different hardware profiles.

WorkloadBottleneckWhat you needWrong choice
LLM serving (70B)Memory capacityHBM3 (H100 80GB x2)Consumer GPU — not enough VRAM
Diffusion TTS (0.5B)Memory bandwidthGDDR7 (RTX 5090, 1.79 TB/s)H100 — overkill, worse $/iteration
Video encodingCPU cores + NVENCMany vCPU + NVENCGPU-only, CPU-starved

A 0.5B diffusion model does many iterative passes over small weights. That’s a bandwidth problem, not a capacity problem. An H100 wastes ~80% of its HBM on a model this size — wrong tool, higher cost per iteration.

Cost per render

GPU$/hrNotes
H100 SXM2.69HBM3 wasted on 0.5B. Wrong tool.
RTX A50000.16Cheap, ~3x slower per iteration.
RTX 40900.34Good value, widely available.
RTX 50900.69GDDR7, best bandwidth/dollar for diffusion.

On an RTX 5090 a full 300-question render costs about $0.40 (~20 min) versus multiple days on the M1 Pro.

Zero-click cloud burst from a Mac

The pattern that makes this practical: spin up a GPU pod, sync code, render, pull the result, then immediately swap to a cheap CPU pod for the datacenter-speed upload.

POD_ID=$(runpodctl pod create \
  --name "render-$(date +%Y%m%d-%H%M)" \
  --gpu-type "NVIDIA GeForce RTX 5090" \
  --volume-id "$RUNPOD_VOLUME_ID" \
  --volume-mount "/workspace" | jq -r '.id')

ssh root@$POD_ID "cd /workspace/pipeline && python3 main.py --workers 12 && npm run render"
scp root@$POD_ID:/workspace/output/final_video.mp4 ./renders/
runpodctl pod delete $POD_ID   # stop GPU billing immediately

A persistent network volume caches model weights so cold-start drops from ~15 min to ~90 seconds. The GPU only bills while it’s actually rendering.

Ephemeral burst beats an always-on GPU box for spiky workloads. Match the machine to the bottleneck, then turn it off.

#gpu#diffusion#runpod#cost#case-study