Incident Review: Intermittent 520 Cloudflare errors with GitLab.com

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Customers visiting GitLab.com from the west coast AMER
2. Autoamated systems sending request from the west coast AMER
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. They see a 520/525 error when visiting GitLab or sending requests.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. 0.35% of the total requests
2. 16% of requests coming from SJC

What were the root causes?

CloudFlare provided the RCA over slack as found in https://gitlab.slack.com/archives/C049SDU1PU4/p1668728422290499, a new feature that was enabled caused this problem.

Incident Response Analysis

How was the incident detected?
1. GitLab Team Member reported the issue internally at slack
How could detection time be improved?
1. Erro ratio for cloudflare status codes

How was the root cause diagnosed?

Looking in haproxy logs we started seeing:

Nov  6 22:18:00 fe-20-lb-gprd haproxy[26940]: <ip>:17820 [06/Nov/2022:22:18:00.851] https/1: SSL handshake failure
Nov  6 22:18:02 fe-20-lb-gprd haproxy[26940]: <ip>:43964 [06/Nov/2022:22:18:01.956] https/1: SSL handshake failure
Nov  6 22:18:03 fe-20-lb-gprd haproxy[26940]: <ip>:44176 [06/Nov/2022:22:18:03.431] https/1: SSL handshake failure

We looked at the cloudflare dashboard and saw the 520 errors are coming from one specific datacentre
We opened a support case with CloudFlare

How could time to diagnosis be improved?
1. N/A
How did we reach the point where we knew how to mitigate the impact?
1. We didn't were just sitting ducks, waiting for CloudFlare to respond to our support ticket for more then 12 hours.
How could time to mitigation be improved?
1. More escalation policies for CloudFlare
What went well?
1. We identified the issue pretty quick that it was an upstream provider problem.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No

What went well?

We quickly discovered that this was an upstream issue
We quickly discovered that this was isolated to 1 data center
Good handover between EOCs between different timezones.

Guidelines

Blameless RCA Guideline

Edited Nov 21, 2022 by Steve Xuereb