Skip to content

Move job deduplication out of the sidekiq redis and into a persistent redis

From #867 (comment 510710523)

Some of our jobs depend on job deduplication to avoid race conditions when jobs would be executed at the same time. Specifically when using the deduplicate :until_executed. Furthermore, if we use this to avoid doing "extra work" we also reduce load on other systems (some heavily deduplicated jobs are known to put a lot of pressure on the database. For this reason it would be good to migrate the redis use for job deduplication out of sidekiq-redis and into a persistent-redis (at the time of writing, this would be SharedState).

Because of the including_scheduled option that workers can specify. We'll also need to migrate the deduplication information for jobs scheduled in the future.

This could be done in the following steps, each are deployed in a separate step:

  1. Initial state: reading the idempotency_key from redis-sidekiq.
  2. Start writing the key to redis-cluster-sidekiq-metadata during the check!-transaction, still reading from redis-sidekiq. The delete! removes the key from both redis-instances. This is done via a MultiStore feature flag toggle.
  3. Wait 6 hours to switch reads. 6 hour is the default duplicate job key TTL which is longer than all user-specified ttls (see below).
  4. Stop reading from redis-sidekiq and use only redis-cluster-sidekiq-metadata
Click to show specified ttl of workers
➜  gitlab git:(master) ✗ rg deduplicate app | rg ttl
app/workers/project_destroy_worker.rb:  deduplicate :until_executed, ttl: 2.hours
app/workers/pipeline_process_worker.rb:  deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/group_destroy_worker.rb:  deduplicate :until_executed, ttl: 2.hours
app/workers/namespaces/process_sync_events_worker.rb:    deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/merge_requests/cleanup_ref_worker.rb:    deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/ci/pipeline_cleanup_ref_worker.rb:    deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/projects/process_sync_events_worker.rb:    deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/container_registry/migration/guard_worker.rb:      deduplicate :until_executed, ttl: 5.minutes
app/workers/container_registry/migration/enqueuer_worker.rb:      deduplicate :until_executing, ttl: DEFAULT_LEASE_TIMEOUT
➜  gitlab git:(master) ✗ rg deduplicate ee/app | rg ttl
ee/app/workers/project_import_schedule_worker.rb:  deduplicate :until_executed, ttl: 5.minutes, if_deduplicated: :reschedule_once
ee/app/workers/work_items/update_parent_objectives_progress_worker.rb:    deduplicate :until_executing, ttl: 5.minutes
ee/app/workers/geo/metrics_update_worker.rb:    deduplicate :until_executed, ttl: 20.minutes

Verify that our de-duplication of sidekiq jobs remains an optimization and not a hard requirement for correct operation of the entire system. If it turns out that some case require it, then we will need to find a way to achieve this in a multi-cluster sidekiq deployment, or those cases will need to be adjusted.

Edited by Sylvester Chin