`:until_executed` jobs with `reschedule_once` not rescheduled due to race condition
This was raised in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5372#note_1982555803
A race condition may cause Sidekiq jobs configured with deduplicate :until_executed, if_deduplicated: :reschedule_once
to not be rescheduled.
Repeating the explanation in the internal issue:
We have 2 process running: A (the sidekiq server, reference this file) and B (the sidekiq client, reference this method). Note that the Redis server is single threaded so all Redis commands are processed sequentially:
- B calls
deduplicatable_job? && check! && duplicate_job.duplicate?
and is found to be true - A calls
duplicate_job.should_reschedule?
afteryield-ing
and getsfalse
- B calls
duplicate_job.set_deduplicated_flag!
- A runs
duplicate_job.delete!
- A does not run
duplicate_job.reschedule
since step 2 returns a false.
Due to the ordering of step 2 and 3, the deduplicated flag does not get set before it is read even though there is a deduplication.
The happy path would be:
- B calls
deduplicatable_job? && check! && duplicate_job.duplicate?
and is found to be true - B calls
duplicate_job.set_deduplicated_flag!
- A calls
duplicate_job.should_reschedule?
afteryield-ing
and getstrue
- A runs
duplicate_job.delete!
- A runs
duplicate_job.reschedule
since step 2 returns a true. - The job is picked up by a Sidekiq server and life goes on.
Proposed fix
We can synchronise on deduplication checks using an exclusive lease. This did not work.
We fixed it with a lock-free approach in !159215 (merged)