RCA: 2019-02-01 HTTP502's

Summary

A brief summary of what happened. Try to make it as executive-friendly as possible.

Service(s) affected: GitLab.com
Team attribution: ~"team:infrastructure"
Minutes downtime or degradation: 3 hours 50 minutes

As part of a maintenance task production#664 (closed), a script was running to slowly move older larger repos to a different file server. Jobs that were being scheduled were failing due to timeouts and then retrying. This led to excessive work being scheduled on the file server that was targeted at that time for maintenance and slowly ramped up the IO load on that file server. This high IO translated into Gitaly timeout for requests going to it, leading to many HTTP502's for customers.

Impact & Metrics

Start with the following:

What was the impact of the incident? slow performance and HTTP 502 responses from GitLab
Who was impacted by this incident? Everyone

Include any additional metrics that are of relevance.

production#674 (closed)

https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&from=1549028551843&to=1549037954496

https://dashboards.gitlab.net/d/000000204/gitaly-metrics-per-host?orgId=1&var-fqdn=file-23-stor-gprd.c.gitlab-production.internal&var-job=gitaly&from=1549028637790&to=1549050237790

Another fun set of charts here: https://dashboards.gitlab.net/d/WOtyonOiz/general-triage-service?orgId=1&var-prometheus_ds=Global&var-environment=gprd&var-type=gitaly&var-sigma=2&var-component_availability=All&var-component_ops=All&var-component_apdex=All&var-component_errors=All&from=1549026945355&to=1549041345355

Detection & Response

How was the incident detected? Customer reports and our beta monitoring
Did alarming work as expected? Not entirely

While we did get alerts for high rates of 500's, this alerting didn't come through until it was too late and the root cause was able to strengthen it's leverage before anything was done. We did not receive any pages.

How long did it take from the start of the incident to its detection? +1 Hour
How long did it take from detection to remediation? +2 Hours

Timeline

2019-02-01

12:53 UTC - maintenance on fileserver 23 had begun - command executed: gitlab-rails runner /tmp/storage_rebalance.rb --current-file-server nfs-file23 --target-file-server nfs-file27 --move-amount 2560 --dry-run false --wait 300 | tee nfs-file23.log
15:02 UTC - first page regarding high 5xx counts - https://gitlab.slack.com/archives/C12RCNXK5/p1549033324163300
15:03 UTC - first alert regarding Gitaly Error counts on fileserver 23 - https://gitlab.slack.com/archives/C12RCNXK5/p1549033383163500
15:30 UTC - oncall engineer reaches out to the engineer performing the maintenance
15:31 UTC - maintenance operation was halted to proceed further
15:34 UTC - estimated 31 projects were in-flight during maintenance, any process related to this maintenance work was renice'd to lower the IO impact on the file server
15:38 UTC - fileserver 23 started to show signs of recovery as git processes were completing
15:58 UTC - fileserver 23 reached okay SLO since degradation
16:41 UTC - fileserver 23 remained above the SLO

Root Cause Analysis

A seemingly routine maintenance item, which had worked for the past entire week without incident resulted in a failure. A script that would routinely query for projects to be moved was chugging along without taking into account the amount of active jobs nor failures and retries of existing jobs. These jobs were responsible for telling a file server to move data from one server to another. Some jobs took a long time and they were marked as a failure due to timeout, despite the underlying mechanisms still running on the file server. When the job was retried, it spun up another process on that file server. One project that should have been moved, would have resulted in 3 total operations on the same project. During the time in which maintenance stopped, it was discovered that 31 projects were scheduled to move off of this file server, and all of them were continuing to process data on the repository before completing the move.

This process of moving data invokes the git upload-pack command which is very IO intensive. For large repositories this can take awhile. The script created for this maintenance was built only to schedule the work and allow sidekiq to schedule the work. It is not visible to the maintenance script nor the operator to the extent of progression or failure rate occurring specifically due to these repo migrations.

What went well

The maintenance operator was able to quickly gather a list of processes tied to project moves and renice them to help the server along.
- We chose not to kill any running processes in order to prevent data corruption
The script was designed to prevent overflowing the queue and needing to do manual intervention of future work that would've been queued.

Questions

Is there a potential network bottleneck?
Can we develop a better understanding of how our Gitaly SLO's impact overall webserver performance?

What can be improved

No one was notified. This situation was not treated as an incident when it should have been. Our support team and our users were lacking any status information.
SRE's need a better understanding of the impact for project migrations, this was a new failure scenario that wasn't run into previously
Documentation for how project moves works doesn't discuss what mechanisms are utilized to perform moves, learning that git upload-pack was running multiple times on the file servers was new information
Visibility into this process overall is highly limited
- This encompasses logging
- Viewing sidekiq metrics
We need to discuss better ways of garnering the attention of the correct parties
- https://gitlab.slack.com/archives/C101F3796/p1549034830974100
- https://gitlab.slack.com/archives/C101F3796/p1549042978985800
- If members think there might be an incident, we need to ensure we mark the correct urgency on potential issues.

Corrective actions

Improvements should be made to the script that was performing these project repo migrations - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6157
Better handling of failure scenarios inside of GitLab - https://gitlab.com/gitlab-org/gitlab-ee/issues/9563
Improve this processes visibility in GitLab - https://gitlab.com/gitlab-org/gitlab-ee/issues/9534
Improve Gitaly process monitoring/niceness - https://gitlab.com/gitlab-org/gitlab-ee/issues/9606
Alert Improvements - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6158
Communication Improvements - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6160

Guidelines

Edited Feb 05, 2019 by John Skarbek