2022-06-08: RepositoryUpdateMirrorWorker seems to be pausing for long periods. Multiple reports of repo mirroring lag
Incident Roles
The DRI for this incident is the incident issue assignee, see roles and responsibilities.
Roles when the incident was declared:
- Incident Manager (IMOC): @ahegyi
- Engineer on-call (EOC): @mwasilewski-gitlab, @anganga
Current Status
We're investigating unusually high delays in our repository mirroring feature. We have identified possible causes and are working on a fix.
Summary for CMOC notice / Exec summary:
- Customer Impact: Some mirror pulling jobs are seeing long delays (~42 minutes).
- Service Impact: Sidekiq (worker: RepositoryUpdateMirrorWorker) - Repo Mirroring feature affected
- Impact Duration: 2022-06-08 12:56 UTC - end time UTC ( duration in minutes )
- Root cause: Delay with our Repository update worker background job
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
- Gitlab.com Latest Updates
All times UTC.
Before the incident
2022-05-23
MR was merged which possibly contributed to the incident.
After the incident
2022-06-08
-
12:56
- @bprescott_ declares incident in Slack. -
13:10
- Identified an MR that might mitigate the issue. -
14:19
- Found a few cases where the worker that schedules the project mirroring jobs timed out. -
15:33
- @mwasilewski-gitlab provided an overview of theUpdateAllMirrorsWorker
worker and how timeouts lead to bursts in scheduling #7223 (comment 976083432) -
16:22
- @jeromezng engaged devoncall which was accepted by @brytannia https://gitlab.slack.com/archives/CLKLMSUR4/p1654705343632479 -
16:31
- @nhoppe1 @nnelson @mchacon3 Found that the spike in project mirror updates overdue has a strong correlation to the feature flagci_variable_for_group_gitlab_deploy_token
#7223 (comment 976164767). => @nhoppe1 later concluded this is likely not the issue #7223 (comment 976514719) -
16:52
- @brytannia identified a MR which seems to be causing the query timeout #7223 (comment 976196198) -
18:30
- @nnelson identified that there has been a very large increase in Scheduled Pull Mirror jobs which started a couple of weeks ago around 05-20-2022. The increase goes from ~30 to ~1400 Scheduled Mirrors. These Pull Mirror Jobs still seem to be being processed but are delayed (a customer reported a delay of ~42 min) likely due to the large number of jobs. #7223 (comment 976450800) -
18:38
- @brytannia identified a potential area of the code which may be causing the increase in Scheduled Pull Mirror jobs #7223 (comment 976477817) -
19:00
- @jeromezng engaged the Create Source Code team who has domain expertise in this area to investigate. @dsatcher identified @kerrizor to help assist with this issue. The next step should be to get these two MRs deployed to production: gitlab-org/gitlab!89564 (merged) gitlab-org/gitlab!89501 (merged). These will hopefully result in a decrease in Scheduled Pull Mirror jobs along with a decrease in Project Mirror Updates Overdue. -
21:00
- a broken spec in the master branch is blocking the merge of 89564 -
23:00
- the master branch has been fixed, the MR branch has been rebased and is running through CI
2022-06-09
-
01:23
- gitlab-org/gitlab!89564 (merged) is merged -
07:26
- After making it to canary, the fix is promoted. Rolling out to staging now -
07:56
- Rolling out to prod now -
09:39
- Deployment to prod has completed -
09:53
- Recovery is observed with gitlab-org/gitlab!89501 (merged) as the fix. Pull mirrors are processed every hour, therefore a further 60 minute observation period is required. -
11:11
- Incident is marked as mitigated. We continue to observe the pull mirror processes. -
15:49
- Incident marked as resolved as it has been 4 hours since the fix was deployed without any further related incidents.
Create related issues
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
Takeaways
- ...
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- ...
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers) Projects using the repository mirroring feature.
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Delayed repository mirroring.
- How many customers were affected? Cannot be assessed accurately.
- If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
What were the root causes?
ProjectImportScheduleWorker
moves projects toscheduled
state, butRepositoryUpdateMirrorWorker
thinks they are infinished
state so it doesn't pick them up. As a result,scheduled
project mirrors keep piling up beforeStuckImportJob
pushes them back tofailed
state. And this process repeats again... And with every new iteration we have more and more abandonedscheduled
mirrors.
Incident Response Analysis
- How was the incident detected? Customer reports, zendesk tickets about delayed mirroring.
- How could detection time be improved? Idea: Monitor the scheduling lag? If a certain lag is reached create an alert.
- How was the root cause diagnosed? Mostly via Grafana dashboard and Kibana.
-
How could time to diagnosis be improved?
- ...
- How did we reach the point where we knew how to mitigate the impact?
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
What went well?
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)