2022-06-08: RepositoryUpdateMirrorWorker seems to be pausing for long periods. Multiple reports of repo mirroring lag

Incident Roles

The DRI for this incident is the incident issue assignee, see roles and responsibilities.

Roles when the incident was declared:

Incident Manager (IMOC): @ahegyi
Engineer on-call (EOC): @mwasilewski-gitlab, @anganga

Current Status

We're investigating unusually high delays in our repository mirroring feature. We have identified possible causes and are working on a fix.

Summary for CMOC notice / Exec summary:

Customer Impact: Some mirror pulling jobs are seeing long delays (~42 minutes).
Service Impact: Sidekiq (worker: RepositoryUpdateMirrorWorker) - Repo Mirroring feature affected
Impact Duration: 2022-06-08 12:56 UTC - end time UTC ( duration in minutes )
Root cause: Delay with our Repository update worker background job

Timeline

Recent Events (available internally only):

All times UTC.

Before the incident

2022-05-23

MR was merged which possibly contributed to the incident.

After the incident

2022-06-08

12:56 - @bprescott_ declares incident in Slack.
13:10 - Identified an MR that might mitigate the issue.
14:19 - Found a few cases where the worker that schedules the project mirroring jobs timed out.
15:33 - @mwasilewski-gitlab provided an overview of the UpdateAllMirrorsWorker worker and how timeouts lead to bursts in scheduling #7223 (comment 976083432)
16:22 - @jeromezng engaged devoncall which was accepted by @brytannia https://gitlab.slack.com/archives/CLKLMSUR4/p1654705343632479
16:31 - @nhoppe1 @nnelson @mchacon3 Found that the spike in project mirror updates overdue has a strong correlation to the feature flag ci_variable_for_group_gitlab_deploy_token #7223 (comment 976164767). => @nhoppe1 later concluded this is likely not the issue #7223 (comment 976514719)
16:52 - @brytannia identified a MR which seems to be causing the query timeout #7223 (comment 976196198)
18:30 - @nnelson identified that there has been a very large increase in Scheduled Pull Mirror jobs which started a couple of weeks ago around 05-20-2022. The increase goes from ~30 to ~1400 Scheduled Mirrors. These Pull Mirror Jobs still seem to be being processed but are delayed (a customer reported a delay of ~42 min) likely due to the large number of jobs. #7223 (comment 976450800)
18:38 - @brytannia identified a potential area of the code which may be causing the increase in Scheduled Pull Mirror jobs #7223 (comment 976477817)
19:00 - @jeromezng engaged the Create Source Code team who has domain expertise in this area to investigate. @dsatcher identified @kerrizor to help assist with this issue. The next step should be to get these two MRs deployed to production: gitlab-org/gitlab!89564 (merged) gitlab-org/gitlab!89501 (merged). These will hopefully result in a decrease in Scheduled Pull Mirror jobs along with a decrease in Project Mirror Updates Overdue.
21:00 - a broken spec in the master branch is blocking the merge of 89564
23:00 - the master branch has been fixed, the MR branch has been rebased and is running through CI

2022-06-09

01:23 - gitlab-org/gitlab!89564 (merged) is merged
07:26 - After making it to canary, the fix is promoted. Rolling out to staging now
07:56 - Rolling out to prod now
09:39 - Deployment to prod has completed
09:53 - Recovery is observed with gitlab-org/gitlab!89501 (merged) as the fix. Pull mirrors are processed every hour, therefore a further 60 minute observation period is required.
11:11 - Incident is marked as mitigated. We continue to observe the pull mirror processes.
15:49 - Incident marked as resolved as it has been 4 hours since the fix was deployed without any further related incidents.

Create related issues

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Takeaways

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers) Projects using the repository mirroring feature.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Delayed repository mirroring.
How many customers were affected? Cannot be assessed accurately.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?

What were the root causes?

#7223 (comment 977020553)

ProjectImportScheduleWorker moves projects to scheduled state, but RepositoryUpdateMirrorWorker thinks they are in finished state so it doesn't pick them up. As a result, scheduled project mirrors keep piling up before StuckImportJob pushes them back to failed state. And this process repeats again... And with every new iteration we have more and more abandoned scheduled mirrors.

Incident Response Analysis

How was the incident detected? Customer reports, zendesk tickets about delayed mirroring.
How could detection time be improved? Idea: Monitor the scheduling lag? If a certain lag is reached create an alert.
How was the root cause diagnosed? Mostly via Grafana dashboard and Kibana.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Jun 10, 2022 by Adam Hegyi