Home DevOps Article

Mux Video's 22-Day Storage Bug: Cascading Failures in JIT Transcoding

TL;DR

Mux experienced a month-long incident where 0.33% of video segments were corrupted due to context cancellation, replication race conditions, and storage scaling changes.

Key Points

  • 0.33% of audio/video segments corrupted between January 8 - February 4, 2026
  • Root cause: Clean stream closure marked partial writes as valid, bypassing error-checking mechanisms
  • Contributing factors: context cancellation propagating errors across concurrent reads, replication race window widening, storage node scaling bottlenecks
  • Remediation: 3 fixes deployed Jan 29; all corrupted segments regenerated and CDN purged by Feb 3

Why It Matters

This incident demonstrates how distributed systems failures cascade through multiple layers—a scaling optimization created bottlenecks that exposed subtle bugs in concurrent file operations and error handling. For engineers building video platforms or distributed storage systems, this post-mortem reveals critical lessons about context propagation in Go, the dangers of API contract assumptions, and why observability at scale requires rethinking log ingestion and incident correlation.
Read the full post-mortem

Source: www.mux.com