2022-06-20: Canary web error rate is elevated during deploy
Incident Roles
The DRI for this incident is the incident issue assignee, see roles and responsibilities.
Roles when the incident was declared:
- Incident Manager (IMOC): @jeromezng
- Engineer on-call (EOC): @cmcfarland
Current Status
There was a software change in Gitaly that caused elevated web error rates on file-cny-01 -- this caused the VM to become slow and unusable. We stopped the for-each-ref
process that was causing this and restarted the gitaly service. We've also reverted the software change in Gitaly gitlab-org/gitlab!90643 (closed). This revert MR is now working its way into the next release package. For the time being, we've locked gprd-cny
from receiving new deploys until we can confirm that the revert MR has made it into the next release package.
Deployments remain blocked for the time being. Incident is mitigated but not marked resolved as resolved as we're waiting for deployments to be unblocked.
Summary for CMOC notice / Exec summary:
- Customer Impact: Any use of projects on the Canary Gitaly VM were slow or returning errors.
- Service Impact: ServiceGitaly
- Impact Duration: 16:53 - 18:56 ( 123 )
- Root cause: RootCauseSoftware-Change
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
- Gitlab.com Latest Updates
All times UTC.
2022-06-20
-
16:51
- The new gitaly package was installed (15.1.202206201420-fdbd8d2381d.0bd846e0033). This was confirmed to be the issue here #7284 (comment 998252877) -
16:53
- file-cny-01 CPU load began to climb to 100% -
17:27
- @cmcfarland declares incident in Slack. -
17:33
- @cmcfarland drained Canary in GPRD -
18:09
- @skarbek initiated rollback of Gitaly on Canary -
18:45
- @stanhu declared a duplicate incident #7285 (closed) -
18:56
- file-cny-01 returns to good working condition -
18:59
- @cmcfarland bumped from S3 to S2 as this was deployment blocking #7284 (comment 998241973) -
19:25
@stanhu created a MR to revert Gitaly changes #7284 (comment 998257333) -
19:25
- @cmcfarland marked incident as mitigated #7284 (comment 998257715) -
19:34
- @skarbek has lockedgprd-cny
from receiving new deploys until we can confirm that the revert MR gitlab-org/gitlab!90643 (closed) has made it into the next package destined #7284 (comment 998264923)
Create related issues
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
Takeaways
- ...
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- localrepo: Speed up calculating size for repo with excluded alternates
- Deprecate reading external gitconfig to have a single source of truth for the Git configuration
- Extend e2e tests for repository size to cover pools
- Allow benchmarks against differently shaped repositories
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Any user of GitLab.com that was trying to work with a project residing on
file-cny-01
Gitaly VM.
- Any user of GitLab.com that was trying to work with a project residing on
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- 5xx errors, timeouts, and slow/long running requests via web, api, or git.
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Based on the information below, about 3% of the requests for the
gitlab-org/gitlab
project were subject to a slowdown of greater than 5 seconds. It's not easy to know how many projects are onfile-cny-01
that were affected, but each one may have seen about 3% of the requests impacted from this incident.
- Based on the information below, about 3% of the requests for the
Requests to gitlab-org/gitlab
during the window that took longer than 5 seconds:
Cloudflare Origin Status Code Errors:
What were the root causes?
- A misconfiguration for the Git client that existed in Omnibus for quite a while was inadvertently fixed. This fix caused us to start to correctly compute repository sizes when they are part of an object pool. This new computation hit an edge case in Git that caused us to burn CPU.
- Gitaly didn't correctly restrict all fetches across repos to ignore alternate refs.
Incident Response Analysis
-
How was the incident detected?
- At first, anecdotal reports of 5xx alerts were reported. Then paging notifications to the EOC.
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- This took a while because of the nature of the problem. It took a little looking through metrics to identify the source as Gitaly and not a Rails deploy to the Canary fleet. Then, it took some time to understand that the gitaly process was spawning the
for-each-ref
processes and the lingering processes were causing the problem, even between restarts. Even at the end of this incident, the full understanding of the source of the problem was not known (to be an omnibus change).
- This took a while because of the nature of the problem. It took a little looking through metrics to identify the source as Gitaly and not a Rails deploy to the Canary fleet. Then, it took some time to understand that the gitaly process was spawning the
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
What went well?
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)