TL;DR
Modal publishes comprehensive GPU health-checking methodology after managing 20,000+ concurrent GPUs and four million cloud instance launches, detailing performance gaps between hyperscalers.
Key Points
- Cloud A achieves 99.6% successful instance launch rate but H100s perform 50% worse on StableDiffusion vs competitors
- Cloud C ran H100s at 90°C+ in 2025, causing 50% performance degradation; performance degrades starting mid-70s Celsius
- Modal implements passive (dmesg, dcgmi health) and active (GPUBurn, NCCL tests) healthchecking; weekly deep checks ensure four nines uptime
- Meta's LLaMA 3 training: GPU issues caused 58.7% of unexpected problems vs 0.5% for CPU issues
Why It Matters
GPU reliability remains a critical pain point for ML/AI infrastructure—Meta's data shows GPUs cause 117x more issues than CPUs during training. Modal's open methodology provides actionable healthchecking patterns for engineers renting multi-cloud GPU capacity, enabling better reliability without sacrificing autoscaling performance.
Source: modal.com