2022-06-20: Canary web error rate is elevated during deploy

Incident Roles

The DRI for this incident is the incident issue assignee, see roles and responsibilities.

Roles when the incident was declared:

Incident Manager (IMOC): @jeromezng
Engineer on-call (EOC): @cmcfarland

Current Status

There was a software change in Gitaly that caused elevated web error rates on file-cny-01 -- this caused the VM to become slow and unusable. We stopped the for-each-ref process that was causing this and restarted the gitaly service. We've also reverted the software change in Gitaly gitlab-org/gitlab!90643 (closed). This revert MR is now working its way into the next release package. For the time being, we've locked gprd-cny from receiving new deploys until we can confirm that the revert MR has made it into the next release package.

Deployments remain blocked for the time being. Incident is mitigated but not marked resolved as resolved as we're waiting for deployments to be unblocked.

Summary for CMOC notice / Exec summary:

Customer Impact: Any use of projects on the Canary Gitaly VM were slow or returning errors.
Service Impact: ServiceGitaly
Impact Duration: 16:53 - 18:56 ( 123 )
Root cause: RootCauseSoftware-Change

Timeline

Recent Events (available internally only):

All times UTC.

2022-06-20

16:51 - The new gitaly package was installed (15.1.202206201420-fdbd8d2381d.0bd846e0033). This was confirmed to be the issue here #7284 (comment 998252877)
16:53 - file-cny-01 CPU load began to climb to 100%
17:27 - @cmcfarland declares incident in Slack.
17:33 - @cmcfarland drained Canary in GPRD
18:09 - @skarbek initiated rollback of Gitaly on Canary
18:45 - @stanhu declared a duplicate incident #7285 (closed)
18:56 - file-cny-01 returns to good working condition
18:59 - @cmcfarland bumped from S3 to S2 as this was deployment blocking #7284 (comment 998241973)
19:25 @stanhu created a MR to revert Gitaly changes #7284 (comment 998257333)
19:25 - @cmcfarland marked incident as mitigated #7284 (comment 998257715)
19:34 - @skarbek has locked gprd-cny from receiving new deploys until we can confirm that the revert MR gitlab-org/gitlab!90643 (closed) has made it into the next package destined #7284 (comment 998264923)

Create related issues

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Takeaways

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Any user of GitLab.com that was trying to work with a project residing on file-cny-01 Gitaly VM.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. 5xx errors, timeouts, and slow/long running requests via web, api, or git.
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Based on the information below, about 3% of the requests for the gitlab-org/gitlab project were subject to a slowdown of greater than 5 seconds. It's not easy to know how many projects are on file-cny-01 that were affected, but each one may have seen about 3% of the requests impacted from this incident.

Requests to gitlab-org/gitlab during the window that took longer than 5 seconds:

Source

Cloudflare Origin Status Code Errors:

What were the root causes?

A misconfiguration for the Git client that existed in Omnibus for quite a while was inadvertently fixed. This fix caused us to start to correctly compute repository sizes when they are part of an object pool. This new computation hit an edge case in Git that caused us to burn CPU.
Gitaly didn't correctly restrict all fetches across repos to ignore alternate refs.

Incident Response Analysis

How was the incident detected?
1. At first, anecdotal reports of 5xx alerts were reported. Then paging notifications to the EOC.
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. This took a while because of the nature of the problem. It took a little looking through metrics to identify the source as Gitaly and not a Rails deploy to the Canary fleet. Then, it took some time to understand that the gitaly process was spawning the for-each-ref processes and the lingering processes were causing the problem, even between restarts. Even at the end of this incident, the full understanding of the source of the problem was not known (to be an omnibus change).
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes. gitlab-org/omnibus-gitlab!6128 (merged)

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Jul 12, 2022 by Patrick Steinhardt