Improve alerting for websocket failures
Summary
Websockets in gitlab.com
failed to upgrade for a majority of traffic during 3 days in incident production#8457 (closed)
This was due to Cloudflare caching rules changes, which changed the behavior of caching from dynamic
to miss
, effectively trying to cache the endpoint, its still unclear why and that is being investigated in a separate issue.
Here we are looking at improving detection since we were not alerted.
Related Incident(s)
Originating issue(s): production#8457 (closed)
Desired Outcome/Acceptance Criteria
Explore websockets workload metrics and figure out if there are reliable metrics we can use to improve alerting (or existing SLI/O) targeting the number of websocket upgrades vs established connections.
During the incident there was a >60% failure rate on connection upgrade, we want to improve detection by:
- Alert on a % threshold of failed upgrades
- Alert on drops/anomaly of established connections
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')