Retro For GitLab.com Cloudflare Change
Incident: production#1843 (closed)
Summary
-
Service(s) affected : All GitLab.com services (CI, git over ssh and https, web app) not about.gitlab.com or the handbook/blog
-
Team attribution : Infrastructure
-
Minutes downtime or degradation : 64 minutes
Metrics
Customer Impact
-
Who was impacted by this incident? All users of GitLab.com
-
What was the customer experience during the incident? In the 64 minutes that we had issues with DNS, GitLab.com would not be reachable. Sometimes resolving would point to the old IP address, then be nothing, then to the new address.
-
How many customers were affected?
-
If a precise customer impact number is unknown, what is the estimated potential impact? Approximate numbers based on request rates:
From query like: sum(increase(haproxy_backend_http_responses_total{}[4h])) at 14:00h
Requests backend | Prior 2 week Average | Difference from Average | % Difference from Average |
---|---|---|---|
total | 83889877.5 | 6771640.5 | 8.072059111 |
git | 11246039.5 | 3132281.5 | 27.85230747 |
web | 12710058.5 | 2354829.5 | 18.52729081 |
api | 26134772.5 | 803298.5 | 3.073677033 |
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Not Applicable - part of planned change. However, we can talk about learnings from the switchover itself.
-
Re-creating the Spectrum Apps by hand in Cloudflare UI seemed to be what made DNS resolution start to work on Cloudflare. We had made these ahead of time more than 30 days ago (via Terraform). Was leaving them idle for that long a problem?
-
Rate limiting became more of an issue. HAProxy configs used src for tcp/http request rate limiting, but this changes with the move to Cloudflare. Now src in most cases is the set of Cloudflare IPs and not the original requestor. This is a known thing and Cloudflare sets a header. It was still hard for us to test this in staging vs what really happens in production.
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?
Timeline
We will be adding more - short version at the moment:
- 2020-03-28 11:09 UTC: started switch of nameservers
- 2020-03-28 11:11 UTC: switch of nameservers appears to have mostly happened
- 2020-03-28 11:15 UTC: started switch of DNS records
- 2020-03-28 11:29 UTC: reached out to Cloudflare support - Spectrum apps appear to not be taking traffic
- 2020-03-28 11:29 UTC: example failing - dig +short gitlab.com @diva.ns.cloudflare.com - this is not expected
- 2020-03-28 12:05 UTC: decided to rebuild Spectrum apps by hand
- 2020-03-28 12:10 UTC: dig +short gitlab.com @diva.ns.cloudflare.com appears to start working
- 2020-03-28 12:18 UTC: traffic on gitlab.com appears to be returning back to normal levels
- 2020-03-28 12:30 UTC: fixed configuration for assets-static site so avatars and other assets would load correctly
- 2020-03-30 03:20 UTC: Rate-limiting and IP whitelists were malfunctioning between the start of the migration this time due to a bug we hit in HAProxy. This bug was worked around in production#1863 (closed)
5 Whys
Why did we need to rebuild the Spectrum apps?
Why/how did TTL issues and DNS caching create issues with us?