Incident Summary
HLS live streaming was fully degraded from 11:40 AM IST to 11:52 AM IST on 28th August 2024. Users were seeing a black screen while streaming live sessions.
Root Cause
We use an event-driven scaling approach to scale web servers when there is a surge in requests. For web servers, we rely on Prometheus metrics for scaling up and down. During this incident, Prometheus crashed due to high memory utilization. In the event of a data source failure, we fall back to specific instances of web servers. Unfortunately, the fallback number of web server instances was inadequate to manage the volume of requests, which led to the servers being overwhelmed and ultimately crashing.