RCA: 2019-02-01 HTTP502's
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
- Service(s) affected: GitLab.com
- Team attribution: ~"team:infrastructure"
- Minutes downtime or degradation: 3 hours 50 minutes
As part of a maintenance task production#664 (closed), a script was running to slowly move older larger repos to a different file server. Jobs that were being scheduled were failing due to timeouts and then retrying. This led to excessive work being scheduled on the file server that was targeted at that time for maintenance and slowly ramped up the IO load on that file server. This high IO translated into Gitaly timeout for requests going to it, leading to many HTTP502's for customers.
Impact & Metrics
Start with the following:
- What was the impact of the incident? slow performance and HTTP 502 responses from GitLab
- Who was impacted by this incident? Everyone
Include any additional metrics that are of relevance.
https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&from=1549028551843&to=1549037954496
https://dashboards.gitlab.net/d/000000204/gitaly-metrics-per-host?orgId=1&var-fqdn=file-23-stor-gprd.c.gitlab-production.internal&var-job=gitaly&from=1549028637790&to=1549050237790
Another fun set of charts here: https://dashboards.gitlab.net/d/WOtyonOiz/general-triage-service?orgId=1&var-prometheus_ds=Global&var-environment=gprd&var-type=gitaly&var-sigma=2&var-component_availability=All&var-component_ops=All&var-component_apdex=All&var-component_errors=All&from=1549026945355&to=1549041345355
Detection & Response
-
How was the incident detected? Customer reports and our beta monitoring
-
Did alarming work as expected? Not entirely
While we did get alerts for high rates of 500's, this alerting didn't come through until it was too late and the root cause was able to strengthen it's leverage before anything was done. We did not receive any pages.
- How long did it take from the start of the incident to its detection? +1 Hour
- How long did it take from detection to remediation? +2 Hours
Timeline
2019-02-01
- 12:53 UTC - maintenance on fileserver 23 had begun - command executed:
gitlab-rails runner /tmp/storage_rebalance.rb --current-file-server nfs-file23 --target-file-server nfs-file27 --move-amount 2560 --dry-run false --wait 300 | tee nfs-file23.log - 15:02 UTC - first page regarding high 5xx counts - https://gitlab.slack.com/archives/C12RCNXK5/p1549033324163300
- 15:03 UTC - first alert regarding Gitaly Error counts on fileserver 23 - https://gitlab.slack.com/archives/C12RCNXK5/p1549033383163500
- 15:30 UTC - oncall engineer reaches out to the engineer performing the maintenance
- 15:31 UTC - maintenance operation was halted to proceed further
- 15:34 UTC - estimated 31 projects were in-flight during maintenance, any process related to this maintenance work was
renice'd to lower the IO impact on the file server - 15:38 UTC - fileserver 23 started to show signs of recovery as
gitprocesses were completing - 15:58 UTC - fileserver 23 reached okay SLO since degradation
- 16:41 UTC - fileserver 23 remained above the SLO
Root Cause Analysis
A seemingly routine maintenance item, which had worked for the past entire week without incident resulted in a failure. A script that would routinely query for projects to be moved was chugging along without taking into account the amount of active jobs nor failures and retries of existing jobs. These jobs were responsible for telling a file server to move data from one server to another. Some jobs took a long time and they were marked as a failure due to timeout, despite the underlying mechanisms still running on the file server. When the job was retried, it spun up another process on that file server. One project that should have been moved, would have resulted in 3 total operations on the same project. During the time in which maintenance stopped, it was discovered that 31 projects were scheduled to move off of this file server, and all of them were continuing to process data on the repository before completing the move.
This process of moving data invokes the git upload-pack command which is very IO intensive. For large repositories this can take awhile. The script created for this maintenance was built only to schedule the work and allow sidekiq to schedule the work. It is not visible to the maintenance script nor the operator to the extent of progression or failure rate occurring specifically due to these repo migrations.
What went well
- The maintenance operator was able to quickly gather a list of processes tied to project moves and
renicethem to help the server along.- We chose not to kill any running processes in order to prevent data corruption
- The script was designed to prevent overflowing the queue and needing to do manual intervention of future work that would've been queued.
Questions
- Is there a potential network bottleneck?
- Can we develop a better understanding of how our Gitaly SLO's impact overall webserver performance?
What can be improved
- No one was notified. This situation was not treated as an incident when it should have been. Our support team and our users were lacking any status information.
- SRE's need a better understanding of the impact for project migrations, this was a new failure scenario that wasn't run into previously
- Documentation for how project moves works doesn't discuss what mechanisms are utilized to perform moves, learning that
git upload-packwas running multiple times on the file servers was new information - Visibility into this process overall is highly limited
- This encompasses logging
- Viewing sidekiq metrics
- We need to discuss better ways of garnering the attention of the correct parties
- https://gitlab.slack.com/archives/C101F3796/p1549034830974100
- https://gitlab.slack.com/archives/C101F3796/p1549042978985800
- If members think there might be an incident, we need to ensure we mark the correct urgency on potential issues.
Corrective actions
- Improvements should be made to the script that was performing these project repo migrations - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6157
- Better handling of failure scenarios inside of GitLab - https://gitlab.com/gitlab-org/gitlab-ee/issues/9563
- Improve this processes visibility in GitLab - https://gitlab.com/gitlab-org/gitlab-ee/issues/9534
- Improve Gitaly process monitoring/niceness - https://gitlab.com/gitlab-org/gitlab-ee/issues/9606
- Alert Improvements - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6158
- Communication Improvements - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6160