Investigate: worker with `sticky` data consistency reads stale data
- Sentry issue: #364635 (closed)
- Incident: gitlab-com/gl-infra/production#7223 (closed)
Description
- MR Switch to `sticky` data consistency for Reposit... (!87995 - merged) changed data consistency from
alwaystostickyforRepositoryUpdateMirrorWorker - Feature flag from this MR was globally enabled on 2022-05-24
- On the same date, we observed an increased number of StuckImportJob errors
-
RepositoryUpdateMirrorWorkerrecorded logs with an error description - https://log.gprd.gitlab.net/goto/b1c8e040-e7ed-11ec-8656-f5f2137823ba - On 2022-06-09, we reverted data consistency from
stickytoalways-> the number of errors significantly decreased
Theory
The simplified chain of events to pull the repository mirror
- UpdateAllMirrorsWorker runs regularly by Cron
- It spawns
ProjectImportScheduleWorkers for each project that requires pull mirror to be updated ProjectImportScheduleWorkerchanges status of the project toscheduled- After that, we create a
RepositoryUpdateMirrorWorkerto perform the update -
RepositoryUpdateMirrorWorkerchecks the status of the project before it starts processing -
RepositoryUpdateMirrorWorkercannot start the update because the project has afinishedstatus
I think that RepositoryUpdateMirrorWorker somehow reads stale data from the replica that did not receive a scheduled update (from step 3). It happens in around ~1.2% of cases.
We see that problem in logs: 'Project was in an inconsistent state: finished'.
After we restored data consistency for RepositoryUpdateMirrorWorker to always, then this problem almost disappeared.
The possible reason for this behavior is that we read data from the replica that is not up-to-date. However, it should not happen. Related code: https://gitlab.com/gitlab-org/gitlab/blob/8c4b269470e817269375f0d972d7eb5aca13566d/lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb#L52
Edited by Vasilii Iakliushin