Skip to content

Improve alerting for websocket failures

Summary

Websockets in gitlab.com failed to upgrade for a majority of traffic during 3 days in incident production#8457 (closed)

This was due to Cloudflare caching rules changes, which changed the behavior of caching from dynamic to miss, effectively trying to cache the endpoint, its still unclear why and that is being investigated in a separate issue.

Here we are looking at improving detection since we were not alerted.

Related Incident(s)

Originating issue(s): production#8457 (closed)

Desired Outcome/Acceptance Criteria

Explore websockets workload metrics and figure out if there are reliable metrics we can use to improve alerting (or existing SLI/O) targeting the number of websocket upgrades vs established connections.

During the incident there was a >60% failure rate on connection upgrade, we want to improve detection by:

  • Alert on a % threshold of failed upgrades
  • Alert on drops/anomaly of established connections

Associated Services

ServiceWebsockets

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose out of
  • Give context for what problem this corrective action is trying to prevent from re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4')