Skip to content

Investigate: worker with `sticky` data consistency reads stale data

Description

  1. MR Switch to `sticky` data consistency for Reposit... (!87995 - merged) changed data consistency from always to sticky for RepositoryUpdateMirrorWorker
  2. Feature flag from this MR was globally enabled on 2022-05-24
  3. On the same date, we observed an increased number of StuckImportJob errors
  4. RepositoryUpdateMirrorWorker recorded logs with an error description - https://log.gprd.gitlab.net/goto/b1c8e040-e7ed-11ec-8656-f5f2137823ba
  5. On 2022-06-09, we reverted data consistency from sticky to always -> the number of errors significantly decreased

Theory

The simplified chain of events to pull the repository mirror

  1. UpdateAllMirrorsWorker runs regularly by Cron
  2. It spawns ProjectImportScheduleWorkers for each project that requires pull mirror to be updated
  3. ProjectImportScheduleWorker changes status of the project to scheduled
  4. After that, we create a RepositoryUpdateMirrorWorker to perform the update
  5. RepositoryUpdateMirrorWorker checks the status of the project before it starts processing
  6. RepositoryUpdateMirrorWorker cannot start the update because the project has a finished status

I think that RepositoryUpdateMirrorWorker somehow reads stale data from the replica that did not receive a scheduled update (from step 3). It happens in around ~1.2% of cases. We see that problem in logs: 'Project was in an inconsistent state: finished'.

After we restored data consistency for RepositoryUpdateMirrorWorker to always, then this problem almost disappeared.

The possible reason for this behavior is that we read data from the replica that is not up-to-date. However, it should not happen. Related code: https://gitlab.com/gitlab-org/gitlab/blob/8c4b269470e817269375f0d972d7eb5aca13566d/lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb#L52

Edited by Vasilii Iakliushin