Skip to content

Draft: Guard sidekiq job dedup logic with exclusive lease

Sylvester Chin requested to merge sc1-dedup-atomic into master

What does this MR do and why?

This MR adds the usage of exclusive lease around job deduplication logic. This ensures that we atomically perform the following logic per idempotency key

  1. checking if an idempotency key is duplicated
  2. updating the latest wal location
  3. setting the duplicate flag

We also atomically perform the check for rescheduling and job deletion in the :until_executed strategy.

See https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5372#note_1982555803

TODO

  • add spec

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

Setup

  1. Run git revert 8c2c1ef6af79b0c076d0de5f9fd50cee963cbd92 to add the debug statements
  2. gdk restart rails. I'd suggest going to http://localhost:3000/admin/sidekiq/cron to disable crons to keep the tail-ed logs quieter.

Reproduce problem

  1. Open 2 terminals - 1 with gdk rails c and the other with gdk tail rails-background-jobs
  2. Trigger 2 PipelineProcessWorker.perform_async(123456) jobs on the console. Trigger 1 then wait for BEFORE DELETING to appear in the tailed logs and trigger the second

On the tailed logs, we should see the following, indicating that only no rescheduling is done

2024-07-04_09:13:45.26128 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-04_09:13:46.27838 rails-background-jobs : BEFORE DELETING
2024-07-04_09:13:47.28044 rails-background-jobs : AFTER DELETING
2024-07-04_09:13:47.28055 rails-background-jobs : RESCHEDULE?: false

On console we should see that the 2nd job was deduplicated.

[1] pry(main)> PipelineProcessWorker.perform_async(123456)
=> "a08032dec019e6399098e1d3"
[2] pry(main)> PipelineProcessWorker.perform_async(123456)
SLEEP BEFORE SETTING DEDUP
=> nil

Verify fix

  1. Enable the feature flag Feature.enable(:use_sidekiq_dedup_lock)
  2. Restart with gdk restart rails-background-jobs
  3. Scenario 1: client acquires the lease first: Trigger 2 PipelineProcessWorker.perform_async(123456) jobs on the console. Trigger both jobs in quick succession. A cue would be to trigger the 2nd job after BEFORE CHECKING RESCHEDULE and before BEFORE DELETING (this shows up after the server acquires the excl lease).

On the server, we should see that the job gets rescheduled since the client dedups a job.

2024-07-04_09:19:22.12530 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-04_09:19:24.20302 rails-background-jobs : BEFORE DELETING
2024-07-04_09:19:25.20895 rails-background-jobs : AFTER DELETING
2024-07-04_09:19:25.21041 rails-background-jobs : RESCHEDULE?: true 
....bunch of job logs
2024-07-04_09:19:25.27957 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-04_09:19:26.29012 rails-background-jobs : BEFORE DELETING
2024-07-04_09:19:27.29261 rails-background-jobs : AFTER DELETING
2024-07-04_09:19:27.29277 rails-background-jobs : RESCHEDULE?: false

On console we should see that the 2nd job was deduplicated.

[4] pry(main)> PipelineProcessWorker.perform_async(123456)
=> "952e8b92a8764ec0e665ad77"
[5] pry(main)> PipelineProcessWorker.perform_async(123456)
SLEEP BEFORE SETTING DEDUP
=> nil
  1. Lets try the other scenario, what if the server acquires the lease first

On the server, we should see that the job does not get reschduled. Instead, another job is executed after the first.

2024-07-04_09:20:13.60808 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-04_09:20:14.61395 rails-background-jobs : BEFORE DELETING
2024-07-04_09:20:15.62015 rails-background-jobs : AFTER DELETING
2024-07-04_09:20:15.62104 rails-background-jobs : RESCHEDULE?: false
... bunch of job logs
2024-07-04_09:20:16.06056 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-04_09:20:17.06930 rails-background-jobs : BEFORE DELETING
2024-07-04_09:20:18.14626 rails-background-jobs : AFTER DELETING
2024-07-04_09:20:18.14641 rails-background-jobs : RESCHEDULE?: false

On console we should see that the 2nd job was not deduplicated. You'd notice a short wait when triggering the second job. That is the exclusive lease sleeping lock retrying.

[6] pry(main)> PipelineProcessWorker.perform_async(123456)
=> "8db959c88499386b3ac099f9"
[7] pry(main)> PipelineProcessWorker.perform_async(123456)
=> "4bc4a9f1064827dbf7d6aa8c"
Edited by Sylvester Chin

Merge request reports