`:until_executed` jobs with `reschedule_once` not rescheduled due to race condition
This was raised in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5372#note_1982555803
A race condition may cause Sidekiq jobs configured with deduplicate :until_executed, if_deduplicated: :reschedule_once to not be rescheduled.
Repeating the explanation in the internal issue:
We have 2 process running: A (the sidekiq server, reference this file) and B (the sidekiq client, reference this method). Note that the Redis server is single threaded so all Redis commands are processed sequentially:
- B calls
deduplicatable_job? && check! && duplicate_job.duplicate?and is found to be true - A calls
duplicate_job.should_reschedule?afteryield-ingand getsfalse - B calls
duplicate_job.set_deduplicated_flag! - A runs
duplicate_job.delete! - A does not run
duplicate_job.reschedulesince step 2 returns a false.
Due to the ordering of step 2 and 3, the deduplicated flag does not get set before it is read even though there is a deduplication.
The happy path would be:
- B calls
deduplicatable_job? && check! && duplicate_job.duplicate?and is found to be true - B calls
duplicate_job.set_deduplicated_flag! - A calls
duplicate_job.should_reschedule?afteryield-ingand getstrue - A runs
duplicate_job.delete! - A runs
duplicate_job.reschedulesince step 2 returns a true. - The job is picked up by a Sidekiq server and life goes on.
Proposed fix
We can synchronise on deduplication checks using an exclusive lease. This did not work.
We fixed it with a lock-free approach in !159215 (merged)