TL;DR
Shopify executed nine months of chaos engineering and load testing, simulating 150% of last year's peak traffic to harden infrastructure before Black Friday.
Key Points
- Five major scale tests from April–October, ramping from 146M to 200M requests per minute
- 2024 baseline: 57.3 PB data, 10.5 trillion database queries, 284M edge RPM, 80M app server RPM
- Game Days exposed and fixed Kafka partitions, memory optimization, connection timeouts, and regional failover issues
- New analytics API layer never experienced holiday traffic—stress-tested via controlled chaos to validate unknown bottlenecks
Why It Matters
This case study reveals production-grade chaos engineering and capacity planning at scale. Engineers can learn concrete techniques for stress-testing distributed systems, identifying cascading failure modes, and coordinating multi-region failover—critical skills for running reliable infrastructure under extreme load.
Source: shopify.engineering