2023-02-23: WebSocket connections are not being established
Current Status
-
2023-02-20 0519 UTC
: Cloudflare cache rules updated, which led to a majority of websocket upgrades to fail on client side due to timeout. https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5042 -
2023-02-23 1849 UTC
: Seeing roughly 60% drop in active websocket connections over the past three days since2023-02-20
. Failed connections are getting a499 Client Closed Request
. This impacts real time features such as labels / reviewers / comments being updated in real time rather than needing a page refresh. Seeing successful websocket connections for 40% of all requests. Requests are timing out and will retry which is why we are starting to see an increase in RPS (requests per second). -
2023-02-23 2149 UTC
: Excluded websocket paths from Cloudflare caching rules, Websocket functionality has returned to normal.
Work around was to refresh the page.
A ticket has been submitted with Cloudflare: https://support.cloudflare.com/hc/en-us/requests/2715923
The caching rules change impacted the websocket endpoint caching behavior, the cache status went from dynamic
to miss
, meaning Cloudflare was trying to cache the endpoing. Its still unclear why, since we didn't have any rules excluding websockets before, to be investigated.
📝 Summary for CMOC notice / Exec summary:
- Customer Impact: Real time features are degraded such as labels / reviewers / comments being updated in real time rather than needing a page refresh.
- Service Impact: ServiceWebsockets
- Impact Duration:
2023-02-19 0000 UTC to 2023-02-23 2149 UTC
- Root cause: RootCauseConfig-Change ServiceCloudflare
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks and/or Hot Patching | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.