Move job deduplication out of the sidekiq redis and into a persistent redis
Some of our jobs depend on job deduplication to avoid race conditions when jobs would be executed at the same time. Specifically when using the deduplicate :until_executed
. Furthermore, if we use this to avoid doing "extra work" we also reduce load on other systems (some heavily deduplicated jobs are known to put a lot of pressure on the database. For this reason it would be good to migrate the redis use for job deduplication out of sidekiq-redis and into a persistent-redis (at the time of writing, this would be SharedState
).
Because of the including_scheduled
option that workers can specify. We'll also need to migrate the deduplication information for jobs scheduled in the future.
This could be done in the following steps, each are deployed in a separate step:
- Initial state: reading the
idempotency_key
fromredis-sidekiq
. - Start writing the key to redis-cluster-sidekiq-metadata during the
check!
-transaction, still reading from redis-sidekiq. Thedelete!
removes the key from both redis-instances. This is done via a MultiStore feature flag toggle. - Wait 6 hours to switch reads. 6 hour is the default duplicate job key TTL which is longer than all user-specified ttls (see below).
- Stop reading from
redis-sidekiq
and use onlyredis-cluster-sidekiq-metadata
Click to show specified ttl of workers
➜ gitlab git:(master) ✗ rg deduplicate app | rg ttl
app/workers/project_destroy_worker.rb: deduplicate :until_executed, ttl: 2.hours
app/workers/pipeline_process_worker.rb: deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/group_destroy_worker.rb: deduplicate :until_executed, ttl: 2.hours
app/workers/namespaces/process_sync_events_worker.rb: deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/merge_requests/cleanup_ref_worker.rb: deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/ci/pipeline_cleanup_ref_worker.rb: deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/projects/process_sync_events_worker.rb: deduplicate :until_executed, if_deduplicated: :reschedule_once, ttl: 1.minute
app/workers/container_registry/migration/guard_worker.rb: deduplicate :until_executed, ttl: 5.minutes
app/workers/container_registry/migration/enqueuer_worker.rb: deduplicate :until_executing, ttl: DEFAULT_LEASE_TIMEOUT
➜ gitlab git:(master) ✗ rg deduplicate ee/app | rg ttl
ee/app/workers/project_import_schedule_worker.rb: deduplicate :until_executed, ttl: 5.minutes, if_deduplicated: :reschedule_once
ee/app/workers/work_items/update_parent_objectives_progress_worker.rb: deduplicate :until_executing, ttl: 5.minutes
ee/app/workers/geo/metrics_update_worker.rb: deduplicate :until_executed, ttl: 20.minutes
Verify that our de-duplication of sidekiq jobs remains an optimization and not a hard requirement for correct operation of the entire system. If it turns out that some case require it, then we will need to find a way to achieve this in a multi-cluster sidekiq deployment, or those cases will need to be adjusted.