Skip to content

`:until_executed` jobs with `reschedule_once` not rescheduled due to race condition

This was raised in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5372#note_1982555803

A race condition may cause Sidekiq jobs configured with deduplicate :until_executed, if_deduplicated: :reschedule_once to not be rescheduled.

Repeating the explanation in the internal issue:

We have 2 process running: A (the sidekiq server, reference this file) and B (the sidekiq client, reference this method). Note that the Redis server is single threaded so all Redis commands are processed sequentially:

  1. B calls deduplicatable_job? && check! && duplicate_job.duplicate? and is found to be true
  2. A calls duplicate_job.should_reschedule? after yield-ing and gets false
  3. B calls duplicate_job.set_deduplicated_flag!
  4. A runs duplicate_job.delete!
  5. A does not run duplicate_job.reschedule since step 2 returns a false.

Due to the ordering of step 2 and 3, the deduplicated flag does not get set before it is read even though there is a deduplication.

The happy path would be:

  1. B calls deduplicatable_job? && check! && duplicate_job.duplicate? and is found to be true
  2. B calls duplicate_job.set_deduplicated_flag!
  3. A calls duplicate_job.should_reschedule? after yield-ing and gets true
  4. A runs duplicate_job.delete!
  5. A runs duplicate_job.reschedule since step 2 returns a true.
  6. The job is picked up by a Sidekiq server and life goes on.

Proposed fix

We can synchronise on deduplication checks using an exclusive lease. This did not work.

We fixed it with a lock-free approach in !159215 (merged)

Edited by Sylvester Chin