Introduce lock-free rescheduling for duplicate job (!159215) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

This MR removes the use of dedup locks and introduces a signalling key for until_executed server middleware to determine whether to reschedule jobs. This is a lock-free way to fix the race condition in rescheduling jobs for until_executed, reschedule_once workers.

Using a simple key instead of a msgpacked string, we can easily use multi to transactionally check and delete the signalling key. Without needing the deduplicated key in the msgpacked string, we can delete the cookie key before performing the transaction above. This way we guarantee that the job is scheduled at least once (either from server or client middleware) during a data race.

This approach only modifies the server middleware's rescheduling logic, leaving the client middleware's deduplication logic completely unchanged. This helps to prevent broken deduplication logic (lock contention under high loads) that caused gitlab-com/gl-infra/production#18264 (comment 1995096958).

Scalability considerations:

No exclusive lease usage. This approach does not add locks.
Redis commands introduced are del and setnx, both runs very fast on the Redis server.
No change to client-middleware deduplication behaviour.

Background of ideas: #471245 (comment 1996229546)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before	After

How to set up and validate locally

With feature flag enabled and cron disabled (gdk.test:3000/admin/sidekiq/cron). To be honest here, I'm not entirely sure if this is a good way to validate the protection against a data race on the rescheduling logic.

Apply the diff. The between dependent Redis commands help create more instances of data races while the cache counters help keep track. While repeating the test, I noticed that it the log does not pipe cleanly into the text file. I opted to use this approach instead where I incremented a counter in the cache database:

Click to expand

diff --git a/lib/gitlab/sidekiq_logging/deduplication_logger.rb b/lib/gitlab/sidekiq_logging/deduplication_logger.rb
index 85d7da048377..527afd27eef3 100644
--- a/lib/gitlab/sidekiq_logging/deduplication_logger.rb
+++ b/lib/gitlab/sidekiq_logging/deduplication_logger.rb
@@ -15,6 +15,7 @@ def deduplicated_log(job, deduplication_type, deduplication_options = {})
         payload.merge!(
           deduplication_options.compact.transform_keys { |k| "deduplication.options.#{k}" })
 
+        Gitlab::Redis::Cache.with { |c| c.incr("expt:deduped") } if job['rescheduled_once']
         Sidekiq.logger.info payload
       end
 
@@ -30,6 +31,8 @@ def rescheduled_log(job)
         payload['job_status'] = 'rescheduled'
         payload['message'] = "#{base_message(payload)}: rescheduled"
 
+        Gitlab::Redis::Cache.with { |c| c.incr("expt:rescheduled") }
+
         Sidekiq.logger.info payload
       end
     end
diff --git a/lib/gitlab/sidekiq_logging/structured_logger.rb b/lib/gitlab/sidekiq_logging/structured_logger.rb
index 4754417639f1..cf1d31e6cf81 100644
--- a/lib/gitlab/sidekiq_logging/structured_logger.rb
+++ b/lib/gitlab/sidekiq_logging/structured_logger.rb
@@ -86,6 +86,8 @@ def log_job_done(job, started_time, payload, job_exception = nil)
                        'done'
                      end
 
+        Gitlab::Redis::Cache.with { |c| c.incr("expt:done") }
+
         payload['message'] = "#{message}: #{job_status}: #{payload['duration_s']} sec"
         payload['job_status'] = job_status
         payload['job_deferred_by'] = job['deferred_by'] if job['deferred']
diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
index 0deb1686b9c9..1b36ce881cfd 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
@@ -35,6 +35,8 @@ def deduplicate?(job)
 
             return false unless check! && duplicate_job.duplicate?
 
+            sleep 0.01
+
             job['duplicate-of'] = duplicate_job.existing_jid
 
             return false unless duplicate_job.idempotent? # only dedup idempotent jobs
diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
index fce677492565..3d0f9762827f 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
@@ -22,10 +22,12 @@ def perform(job)
             should_reschedule = if Feature.enabled?(:use_sidekiq_dedup_signaling, type: :beta) # rubocop:disable Gitlab/FeatureFlagWithoutActor -- global flags
                                   # use the deduplicating-signalling key
                                   duplicate_job.delete!
+                                  sleep 0.01
                                   duplicate_job.check_reschedule_and_delete
                                 else
                                   # use the deduplicated flag in cookie
                                   res = duplicate_job.should_reschedule?
+                                  sleep 0.01
                                   duplicate_job.delete!
                                   res
                                 end

Run on rails console

[1] pry(main)> count = []
=> []
[2] pry(main)> threads = []
=> []
[3] pry(main)> 10.times {
[3] pry(main)*   threads << Thread.new {100.times {  count << PipelineProcessWorker.perform_async(123456) != nil } }
[3] pry(main)* }
=> 10
[4] pry(main)> threads.each(&:join)
=> [#<Thread:0x000000016ebfa828 (pry):4 dead>,
 #<Thread:0x000000016ebfa670 (pry):4 dead>,
 #<Thread:0x000000016ebfa530 (pry):4 dead>,
 #<Thread:0x000000016ebfa418 (pry):4 dead>,
 #<Thread:0x000000016ebfa2b0 (pry):4 dead>,
 #<Thread:0x000000016ebfa1c0 (pry):4 dead>,
 #<Thread:0x000000016ebfa0a8 (pry):4 dead>,
 #<Thread:0x000000016ebf9f18 (pry):4 dead>,
 #<Thread:0x000000016ebf9e28 (pry):4 dead>,
 #<Thread:0x000000016ebf9cc0 (pry):4 dead>]
[5] pry(main)> count.size
=> 1000
[6] pry(main)> count.compact.count
=> 78

Check the 3 counter keys. It should add up to the following formula (note that this differs by runs).

redis /Users/sylvesterchin/work/gitlab-development-kit/redis/redis.socket[2]> mget expt:deduped expt:rescheduled expt:done
1) "75"
2) "99"
3) "102"
-----
"rescheduled" - "deduplicated" + count = "done"
99-75+78 = 102

Note: Rescheduled jobs could get deduplicated too. This can be seen in the logs.

Repeating this with feature flag disabled yields the same result. This shows that dedup logic works as intended. Note: i've ran this many times with varying amount and random sleeps injected but both approaches would add up correctly. The race condition can only be replicated with intentionally timed sleeps (see below).

Race condition validation

We can re-use !158329 (merged) to show that the existing solution will create a race condition while this approach does not.

Click to show diff

diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
index 0deb1686b9c9..aa7eb1276efe 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
@@ -39,6 +39,9 @@ def deduplicate?(job)

             return false unless duplicate_job.idempotent? # only dedup idempotent jobs

+            puts "SLEEP BEFORE SETTING DEDUP"
+            sleep 1
+
             duplicate_job.update_latest_wal_location!
             duplicate_job.set_deduplicated_flag!

diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
index fce677492565..107d3bdaf3f5 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
@@ -19,17 +19,25 @@ def perform(job)
             # early return since not reschedulable. ensure block to handle cleanup.
             return unless duplicate_job.reschedulable?

+            puts "BEFORE CHECKING RESCHEDULE" # rubocop:disable Rails/Output -- debug
+            sleep 1
+
             should_reschedule = if Feature.enabled?(:use_sidekiq_dedup_signaling, type: :beta) # rubocop:disable Gitlab/FeatureFlagWithoutActor -- global flags
                                   # use the deduplicating-signalling key
                                   duplicate_job.delete!
+                                  puts "AFTER DELETING"
+                                  sleep 1
                                   duplicate_job.check_reschedule_and_delete
                                 else
                                   # use the deduplicated flag in cookie
                                   res = duplicate_job.should_reschedule?
+                                  puts "BEFORE DELETING"
+                                  sleep 1
                                   duplicate_job.delete!
                                   res
                                 end

+            puts "RESCHEDULE?: #{should_reschedule}"
             job_deleted = true
             duplicate_job.reschedule if should_reschedule

Steps to reproduce the problem can be found in the past MR linked above
To verify fix, enable the feature flag
Scenario 1: client does dedup check after server deletes cookie Trigger 2 PipelineProcessWorker.perform_async(123456) jobs on the console. A cue would be to trigger the 2nd job after BEFORE CHECKING RESCHEDULE and before AFTER DELETING (this shows up after the server acquires the excl lease).
The client will schedule the job since the server deletes the cookie key and no deduplication happens:

# 1st job
024-07-14_13:20:25.23358 rails-background-jobs : BEFORE CHECKING RESCHEDULE 
2024-07-14_13:20:26.23981 rails-background-jobs : AFTER DELETING
...
2024-07-14_13:20:26.70534 rails-background-jobs : BEFORE CHECKING RESCHEDULE # 2nd job. this is fine since the the first job has completed the worker logic
2024-07-14_13:20:27.24511 rails-background-jobs : RESCHEDULE?: false # 1st job
...
2024-07-14_13:20:27.71332 rails-background-jobs : AFTER DELETING # 2nd job
2024-07-14_13:20:28.71956 rails-background-jobs : RESCHEDULE?: false # 2nd job

Scenario 2: client does dedup check before server deletes cookie: Run PipelineProcessWorker.perform_async(123456); PipelineProcessWorker.perform_async(123456) to quickly trigger 2 jobs on the console. Observe that server reschedules the job.

# 1st job
2024-07-14_13:25:30.25530 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-14_13:25:31.26005 rails-background-jobs : AFTER DELETING
2024-07-14_13:25:32.26398 rails-background-jobs : RESCHEDULE?: true
...
# 2nd job
2024-07-14_13:25:32.33674 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-14_13:25:33.34506 rails-background-jobs : AFTER DELETING
2024-07-14_13:25:34.34886 rails-background-jobs : RESCHEDULE?: false

Edited Jul 16, 2024 by Sylvester Chin

Introduce lock-free rescheduling for duplicate job

What does this MR do and why?

MR acceptance checklist

Screenshots or screen recordings

How to set up and validate locally

Merge request reports