Incident Review: Intermittent 520 Cloudflare errors with GitLab.com
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Customers visiting GitLab.com from the west coast AMER
- Autoamated systems sending request from the west coast AMER
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- They see a 520/525 error when visiting GitLab or sending requests.
- If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
What were the root causes?
CloudFlare provided the RCA over slack as found in https://gitlab.slack.com/archives/C049SDU1PU4/p1668728422290499, a new feature that was enabled caused this problem.
Incident Response Analysis
-
How was the incident detected?
- GitLab Team Member reported the issue internally at slack
-
How could detection time be improved?
- Erro ratio for cloudflare status codes
-
How was the root cause diagnosed?
- Looking in haproxy logs we started seeing:
Nov 6 22:18:00 fe-20-lb-gprd haproxy[26940]: <ip>:17820 [06/Nov/2022:22:18:00.851] https/1: SSL handshake failure Nov 6 22:18:02 fe-20-lb-gprd haproxy[26940]: <ip>:43964 [06/Nov/2022:22:18:01.956] https/1: SSL handshake failure Nov 6 22:18:03 fe-20-lb-gprd haproxy[26940]: <ip>:44176 [06/Nov/2022:22:18:03.431] https/1: SSL handshake failure- We looked at the cloudflare dashboard and saw the 520 errors are coming from one specific datacentre
- We opened a support case with CloudFlare
-
How could time to diagnosis be improved?
- N/A
-
How did we reach the point where we knew how to mitigate the impact?
- We didn't were just sitting ducks, waiting for CloudFlare to respond to our support ticket for more then 12 hours.
-
How could time to mitigation be improved?
- More escalation policies for CloudFlare
-
What went well?
- We identified the issue pretty quick that it was an upstream provider problem.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No
What went well?
- We quickly discovered that this was an upstream issue
- We quickly discovered that this was isolated to 1 data center
- Good handover between EOCs between different timezones.
Guidelines
Edited by Steve Xuereb