`:until_executed` jobs with `reschedule_once` not rescheduled due to race condition

This was raised in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5372#note_1982555803

A race condition may cause Sidekiq jobs configured with deduplicate :until_executed, if_deduplicated: :reschedule_once to not be rescheduled.

Repeating the explanation in the internal issue:

We have 2 process running: A (the sidekiq server, reference this file) and B (the sidekiq client, reference this method). Note that the Redis server is single threaded so all Redis commands are processed sequentially:

B calls deduplicatable_job? && check! && duplicate_job.duplicate? and is found to be true
A calls duplicate_job.should_reschedule? after yield-ing and gets false
B calls duplicate_job.set_deduplicated_flag!
A runs duplicate_job.delete!
A does not run duplicate_job.reschedule since step 2 returns a false.

Due to the ordering of step 2 and 3, the deduplicated flag does not get set before it is read even though there is a deduplication.

The happy path would be:

B calls deduplicatable_job? && check! && duplicate_job.duplicate? and is found to be true
B calls duplicate_job.set_deduplicated_flag!
A calls duplicate_job.should_reschedule? after yield-ing and gets true
A runs duplicate_job.delete!
A runs duplicate_job.reschedule since step 2 returns a true.
The job is picked up by a Sidekiq server and life goes on.

Proposed fix

~~We can synchronise on deduplication checks using an exclusive lease.~~ This did not work.

We fixed it with a lock-free approach in !159215 (merged)

Edited Jul 25, 2024 by Sylvester Chin