Skip to content

2020-09-16 TLS connectivity issues

Summary

Between approximately 17:00-1800 UTC, we received various reports of TLS connection errors from an apparently random subset of requests to GitLab.com. We are looking into logs of those errors and talking to our providers about failures in that time window.

2020-09-16

prometheus-01-inf-testbed chef-client errors

Timeline

All times UTC.

2020-09-16

  • 18:32 - Firing 1 - Chef client failures have reached critical levels
  • 18:46 - alejandro declares incident in Slack using /incident declare command.
  • 19:22 - Alert cleared
  • 21:14 - Interactions with gitlab.com and ops.gitlab.net appear to be normal. We have evidence of errors in the past few hours we are going to continue to investigate.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

On 2020-09-16 we got notice of slowdowns and varied TLS issues both from user reports and from errors we experienced in our own infrastructure and projects.

  1. Service(s) affected: ServiceWeb ServiceAPI ServiceInfrastructure
  2. Team attribution: teamReliability
  3. Minutes downtime or degradation: According to CloudFlare's RCA impact starts at 17:08 and ends at 20:47. This adds up to 3 hours, 39 minutes

Metrics

Our monitoring doesn't show any noticeable variation for this incident overall, given it occurred in an external component. On hosts performing requests we were able to see an increase in TCP retries:

Screen_Shot_2020-09-16_at_1.26.19_PM

https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats-old-prometheus?viewPanel=44&orgId=1&from=now-7d&to=now&var-environment=gstg&var-node=prometheus-02-inf-gstg.c.gitlab-staging-1.internal&var-promethus=prometheus-01-inf-gstg

Customer Impact

  1. Who was impacted by this incident? Both the GitLab team, due to failing pipelines in our projects, including the main GitLab repository, and other users of GitLab.com
  2. What was the customer experience during the incident? Slowdown in git-over-https operations, sometimes resulting in timeouts or disconnects
  3. How many customers were affected? According to CloudFlare "Stream uploads and downloads from the origin [GCP] were slow and intermittently timing out. Someother origin traffic (but not all routes) saw high latency". The symptom that alerted us to this issue was a chef-client failure in one of our hosts, but no other host had chef-client errors during the incident. Analogously, we saw an increase in pipeline failures for gitlab-org/gitlab but not for other projects
  4. If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

  1. How was the event detected? User reports (https://gitlab.slack.com/archives/C101F3796/p1600278731474100) and alert for chef-client failures
  2. How could detection time be improved? The fact that one of our chef-client runs failed and caused the alert seems to have been fortuitous, and it's possible this incident could've gone by without any of our alerts triggering. One idea would be to add more geographically-distributed probes for git-over-http and git-over-ssh operations.
  3. How did we reach the point where we knew how to mitigate the impact? In this case, our internal investigation was not fruitful, and we only confirmed a root cause once CloudFlare confirmed the incident on their side.
  4. How could time to mitigation be improved? Mitigation of this incident was outside of our power.

Post Incident Analysis

  1. How was the root cause diagnosed? We opened a ticket with CloudFlare and with Google (via Rackspace) providing the data for the issues we were observing. Google responded stating they found no issues on their side. CloudFlare inititally responded the same way, but later confirmed an issue on their infrastructure
  2. How could time to diagnosis be improved? In this case, the factor under our control was how fast we reached to the external providers. The time to diagnosis after that is outside our power.
  3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
  4. Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No

5 Whys

  1. Customers experienced latency increases and disconnection issues over TCP, why?
    • From Cloudflare's RCA: "Customers who used a certain origin provider saw high latency after the origin removed all advertisements in North America except in Sao Paolo. Cloudflare advertised the origin prefixes over the backbone, which caused all origin pulls for this origin to route through Sao Paolo for all backbone-connected datacenters."
  2. Why did we only see a single alert trigger?
    • All our components were healthy and working within their SLOs. The latency was occurring outside our infrastructure (either before requests reached our layers or after responses had been sent).
  3. Why did we not see more widespread user reports or pipeline failures?
    • Connectivity issues were not consistent, so some (most?) operations were not affected. In our tests during the incident, operations in small repositories were usually much more reliable than operations in large repositories, where latencies and timeouts had a much greater chance of impact in data transfers of over ~100mb. This might also explains why most repository operations didn't seem affected.

Lessons Learned

  1. We don't have a way of monitoring pipeline failures that can point us to an infrastructure issues (as opposed to a misconfiguration or code changes related issues). According to our current workflow "The Engineering Productivity team is the triage DRI for monitoring master pipeline failures".
  2. Our probing alerting is limited and geographically contained: we rely mostly on blackbox probes that also live on GCP.

Corrective Actions

  1. Assuming that most pipeline failures are caused by configuration or code changes, our current workflow of having the Engineering Productivity team perform the triaging is appropriate. We could slightly improve the handbook page https://about.gitlab.com/handbook/engineering/workflow/#broken-master: the slack channel #master-broken specifies :infrastructure: as the emoji to use to identify a pipeline failure as (possibly) originating from infra issues, but this is not mentioned in the handbook. We could also perhaps link to the relevant handbook page for production incidents escalation.
  2. Add more geographically distributed probes for git-over-http and git-over-ssh operations, perhaps outside of GCP. (although it's by no means certain that such additional probes would've necessarily caught this issue).

Guidelines

Edited by Alejandro Rodríguez