HLS live streaming was down
Incident Report for 100ms Systems
Postmortem

Incident Summary

HLS live streaming was fully degraded from 11:40 AM IST to 11:52 AM IST on 28th August 2024. Users were seeing a black screen while streaming live sessions.

Root Cause

We use an event-driven scaling approach to scale web servers when there is a surge in requests. For web servers, we rely on Prometheus metrics for scaling up and down. During this incident, Prometheus crashed due to high memory utilization. In the event of a data source failure, we fall back to specific instances of web servers. Unfortunately, the fallback number of web server instances was inadequate to manage the volume of requests, which led to the servers being overwhelmed and ultimately crashing.

Posted Aug 28, 2024 - 10:07 UTC

Resolved
The HLS live streaming was down from 11:41AM IST to 11:52AM IST in India region
Posted Aug 28, 2024 - 06:30 UTC