2020-11-11: file-cny-01-stor-gprd violating its apdex SLO - gitlab-org/gitlab pre-clone script related

Summary

The gitaly canary server has a raised error rate and 0% of PostUploadPack have acquired rate limit lock within a minute. This is only affecting GitLab repos.

Timeline

All times UTC.

2020-11-11

18:53 - alex declares incident in Slack.

2020-11-12

08:26 - Andrew mentions in Slack that the pre-clone script could be the cause: https://gitlab.slack.com/archives/C01EBK5A6AJ/p1605169590017900
10:20 - Ahmad, Alberto, Sebastian, and Sean join the incident call and ask for assistance from Gitaly and Engineering Productivity.
10:33 - Albert and Mark from Engineering Productivity point to an MR that changed the pre-clone script: gitlab-org/gitlab!47220 (merged)
10:47 - Mark updates the pre-clone script and we wait for new CI jobs to pick this up.
11:04 - we see this having a positive effect: #3013 (comment 446246020)
11:27 - we mark the incident as mitigated.

Corrective Actions

gitlab-org&1692
1. This would let us remove the pre-clone script, and it would also let us see the benefits for other heavy CI users without manual configuration needed on their part.
2. No estimated date.
3. Owned by Gitaly.

Summary

For about 17 hours (from around 2020-11-11 18:00 UTC to around 2020-11-12 11:00 UTC), we saw extremely high load on the main GitLab Rails repository (gitlab-org/gitlab), which fired multiple alerts and let to pipelines failing due to an inability to check out the repo. This was caused by a change to our CI pipeline configuration that led to us performing a full clone of the repo far more often than we usually do. It was mitigated by fixing the CI configuration.

Service(s) affected: git
Team attribution: Engineering Productivity, Gitaly
Minutes downtime or degradation: 17 hours

More details

The root cause was a change in our repo cache: gitlab-org/gitlab!47220 (merged) To avoid putting too much pressure on the repository, we have a scheduled job that generates a clone of the repository every two hours and stores it in object storage. Then, when a job starts, it fetches that from object storage and performs a git fetch operation to collect the latest commits. This is much more efficient than running a full git clone each time.

The MR in question allowed us to switch out the full clone in object storage for a shallow clone, to reduce the setup time further. This had some post-merge steps, but they weren't strictly required. The MR was designed to 'just work' with either the shallow clone or the full clone.

Unfortunately, due to a bug in the MR, the full clone was generated with the wrong path: gitlab-org/gitlab!47220 (comment 446298624) This meant that the CI job did not 'see' the repository we downloaded from object storage, and performed a full git clone each time. Performing the post-merge steps mitigated the incident because the shallow clone was generated with the correct path.

Metrics

The most relevant metrics were saturation on the Gitaly side: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?from=now-7d&orgId=1&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-sigma=2&var-stage=cny

Customer Impact

Who was impacted by this incident?
1. Only people trying to work on the main GitLab repository.
What was the customer experience during the incident?
1. No change.
How many customers were affected?
1. Only internal users.

Incident Response Analysis

How was the event detected?
1. Alerts firing.
How could detection time be improved?
1. No recommendation.
How did we reach the point where we knew how to mitigate the impact?
1. We relied on tribal knowledge that we use a pre-clone script to improve performance here: https://docs.gitlab.com/ee/user/gitlab_com/#pre-clone-script
How could time to mitigation be improved?
1. This script is not very well-known, but also extremely important. Purely on the mitigation side, having this knowledge available more widely would have helped. But then this is a very specific piece of information.

Post Incident Analysis

How was the root cause diagnosed?
1. Code review of the MR in question, followed by inspection of job logs and the generated tarball itself.
How could time to diagnosis be improved?
1. No recommendation.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
1. Yes, https://docs.gitlab.com/ee/user/gitlab_com/#pre-clone-script
Was this incident triggered by a change?
1. Yes, gitlab-org/gitlab!47220 (merged)

5 Whys

Clones of GitLab were taking longer and sometimes failing in CI, why?
1. The Gitaly server was saturated.
Why was it saturated?
1. We were performing full clones much more frequently than usual.
Why were we performing full clones more often?
1. There was a bug in a change we made to us to fetch a tarball of a shallow clone instead of a full clone.
Why did we not find the bug sooner?
1. The bug is in a CI script and it's hard to test those without running them.
2. We didn't realise that this change caused the issue.
Why did we not realise this change caused the issue?
1. Most incidents like this are due to deployed code. Because this is CI configuration, it took effect as soon as it was merged, so we looked at the wrong set of MRs.

Lessons Learned

The CI pre-clone script is very important to the performance of the hosting of our main repo and therefore our development pipelines.
Relying on project-specific configuration for a general concern (efficient fetching of repos for CI jobs) leads to more brittle solutions.

Corrective actions:

CI config changes should be communicated each time in Slack to increase the likelihood of linking an incident to a CI config change: gitlab-org/gitlab#282433 (closed)
Improve the communication guidelines around pipeline changes, gitlab-com/www-gitlab-com!68320 (merged).

Guidelines

Blameless RCA Guideline

Edited Nov 23, 2020 by Alberto Ramos