2020-09-16 TLS connectivity issues

Summary

Between approximately 17:00-1800 UTC, we received various reports of TLS connection errors from an apparently random subset of requests to GitLab.com. We are looking into logs of those errors and talking to our providers about failures in that time window.

2020-09-16

prometheus-01-inf-testbed chef-client errors

Timeline

All times UTC.

2020-09-16

18:32 - Firing 1 - Chef client failures have reached critical levels
18:46 - alejandro declares incident in Slack using /incident declare command.
19:22 - Alert cleared
21:14 - Interactions with gitlab.com and ops.gitlab.net appear to be normal. We have evidence of errors in the past few hours we are going to continue to investigate.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

On 2020-09-16 we got notice of slowdowns and varied TLS issues both from user reports and from errors we experienced in our own infrastructure and projects.

Service(s) affected: ServiceWeb ServiceAPI ServiceInfrastructure
Team attribution: teamReliability
Minutes downtime or degradation: According to CloudFlare's RCA impact starts at 17:08 and ends at 20:47. This adds up to 3 hours, 39 minutes

Metrics

Our monitoring doesn't show any noticeable variation for this incident overall, given it occurred in an external component. On hosts performing requests we were able to see an increase in TCP retries:

https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats-old-prometheus?viewPanel=44&orgId=1&from=now-7d&to=now&var-environment=gstg&var-node=prometheus-02-inf-gstg.c.gitlab-staging-1.internal&var-promethus=prometheus-01-inf-gstg

Customer Impact

Who was impacted by this incident? Both the GitLab team, due to failing pipelines in our projects, including the main GitLab repository, and other users of GitLab.com
What was the customer experience during the incident? Slowdown in git-over-https operations, sometimes resulting in timeouts or disconnects
How many customers were affected? According to CloudFlare "Stream uploads and downloads from the origin [GCP] were slow and intermittently timing out. Someother origin traffic (but not all routes) saw high latency". The symptom that alerted us to this issue was a chef-client failure in one of our hosts, but no other host had chef-client errors during the incident. Analogously, we saw an increase in pipeline failures for gitlab-org/gitlab but not for other projects
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected? User reports (https://gitlab.slack.com/archives/C101F3796/p1600278731474100) and alert for chef-client failures
How could detection time be improved? The fact that one of our chef-client runs failed and caused the alert seems to have been fortuitous, and it's possible this incident could've gone by without any of our alerts triggering. One idea would be to add more geographically-distributed probes for git-over-http and git-over-ssh operations.
How did we reach the point where we knew how to mitigate the impact? In this case, our internal investigation was not fruitful, and we only confirmed a root cause once CloudFlare confirmed the incident on their side.
How could time to mitigation be improved? Mitigation of this incident was outside of our power.

Post Incident Analysis

How was the root cause diagnosed? We opened a ticket with CloudFlare and with Google (via Rackspace) providing the data for the issues we were observing. Google responded stating they found no issues on their side. CloudFlare inititally responded the same way, but later confirmed an issue on their infrastructure
How could time to diagnosis be improved? In this case, the factor under our control was how fast we reached to the external providers. The time to diagnosis after that is outside our power.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No

5 Whys

Customers experienced latency increases and disconnection issues over TCP, why?
- From Cloudflare's RCA: "Customers who used a certain origin provider saw high latency after the origin removed all advertisements in North America except in Sao Paolo. Cloudflare advertised the origin prefixes over the backbone, which caused all origin pulls for this origin to route through Sao Paolo for all backbone-connected datacenters."
Why did we only see a single alert trigger?
- All our components were healthy and working within their SLOs. The latency was occurring outside our infrastructure (either before requests reached our layers or after responses had been sent).
Why did we not see more widespread user reports or pipeline failures?
- Connectivity issues were not consistent, so some (most?) operations were not affected. In our tests during the incident, operations in small repositories were usually much more reliable than operations in large repositories, where latencies and timeouts had a much greater chance of impact in data transfers of over ~100mb. This might also explains why most repository operations didn't seem affected.

Lessons Learned

We don't have a way of monitoring pipeline failures that can point us to an infrastructure issues (as opposed to a misconfiguration or code changes related issues). According to our current workflow "The Engineering Productivity team is the triage DRI for monitoring master pipeline failures".
Our probing alerting is limited and geographically contained: we rely mostly on blackbox probes that also live on GCP.

Corrective Actions

Assuming that most pipeline failures are caused by configuration or code changes, our current workflow of having the Engineering Productivity team perform the triaging is appropriate. We could slightly improve the handbook page https://about.gitlab.com/handbook/engineering/workflow/#broken-master: the slack channel #master-broken specifies :infrastructure: as the emoji to use to identify a pipeline failure as (possibly) originating from infra issues, but this is not mentioned in the handbook. We could also perhaps link to the relevant handbook page for production incidents escalation.
Add more geographically distributed probes for git-over-http and git-over-ssh operations, perhaps outside of GCP. (although it's by no means certain that such additional probes would've necessarily caught this issue).

Guidelines

Blameless RCA Guideline

Edited Oct 07, 2020 by Alejandro Rodríguez