Sidekiq spawns multiple instances of the package metadata sync job

Summary

When executing via sidekiq it is possible for several instances of the PackageMetadata::SyncWorker to be running at the same time.

This appears to be related to sidekiq shutdown/restart stopping job execution but not removing its job id from the queue. When sidekiq is restarted the job is requeued and started again. But there will also be a job kicked off via cron and the deduplicate option is ignored: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/package_metadata/sync_worker.rb#L12

Though this has no effect on data integrity, it does have an effect on query performance. In certain cases multiple running jobs can have up to a 10x slow-down of bulk inserts.

Steps to reproduce

This is a semi-reliable way to reproduce the issue.

modify cron job to run every minute https://gitlab.com/gitlab-org/gitlab/-/blob/4a4680538c36f988983e590b91342dd986d877c2/config/initializers/1_settings.rb#L835
run sidekiq locally (bundle exec sidekiq)
wait for cron job to start (sidekiq.log will show start job_status for PackageMetadata::SyncWorker
once job is running interrupt it with a kill command
restart sidekiq you should see 2 starts (job_status: start) for PackageMetadata::SyncWorker with different job ids

What is the current bug behavior?

Several jobs executing PackageMetadata::SyncWorker run concurrently.

What is the expected correct behavior?

Only a single instance of PackageMetadata::SyncWorker should run.

Relevant logs and/or screenshots

This is a slice of the log showing what happens when jobs are interrupted.

job with id db2c023935492790ed4cfccc starts at 19:00 (utc)
it gets interrupted twice (at 19:01 and 19:23) (you can tell by the interrupted_count column)
finally it is interrupted again somewhere between 19:23 at 19:47 and then added to the interrupted queue
after this cron starts another job 92654433867411977c08fe69 at 20:00 which also gets interrupted twice
cron starts yet another job bb5248fcfe0100295a48820c at 21:00

Additionally, the 2nd entry in the linked log slice shows a side-effect of multiple SyncWorker jobs running at the same time: the possibility of dead locks.

Jobs can be correlated with an interrupt by checking the logs for a node associated with the job. For example, looking at the json.tag of db2c023935492790ed4cfccc and then checking all the logs for that tag, we can see that a sidekiq shutdown initiated that interrupt.

Possible fixes

investigate options for making deduplication ignore the interrupt case
ensure that PackageMetadata::SyncWorker uses its own lock to find out whether a job is executing
reduce chunk size so that more checkpoints can be taken and progress made more quickly with less interrupts catching the job

For 1 and 2 there seem quite a few edge cases to think about. Preventing an interrupt but having the job fail ensures that no new job starts until the duplication key is removed for the cron job. This seems to not have a straight forward way of removing a dead deduplication key and has edge cases in itself.

Number 3 seems to be the most feasible and straightforward fix. Reducing chunk size will allow the sync to make more frequent progress and allow the worker to check whether its duration within a fixed TTL. It will also avoid the case when the chunk size is so large that no checkpoint is created before before the job is killed and no progress is made. A higher frequency job interval matching the approximate duration doesn't need elaborate locking mechanisms or accounting for interrupts.

Reducing chunk size

A test in local environment returns the following numbers for different chunk sizes:

10k lines (~260KB)
 full ingestion duration: 875.695
 avg time to process 1 chunk: 3.327064638783272
50k lines (~1.5MB per chunk)
 full ingestion duration: 919.703
 avg time to process 1 chunk: 17.345245283018865
100k lines (~2.5MB per chunk)
 full ingestion duration: 880.572
 avg time to process 1 chunk: 32.602777777777774
500k lines (~13MB per chunk)
 full ingestion duration: 880.45
 avg time to process 1 chunk: 146.7055
1m lines (~26MB per chunk)
 full ingestion duration: 893.13
 avg time to process 1 chunk: 297.63533333333334
1.5m lines (~39M per chunk)
 full ingestion duration: 863.292
 avg time to process 1 chunk: 431.5145

A 5 minute TTL is recommended for workers in the low urgency queue. Accounting for differences in resourcing, a safer bet seems to be 50K lines. At ~15s even a 10x increase in duration would allow the service to checkpoint before 5m elapses.

Implementation plan

update license-exporter to export a smaller (~2MB) chunk https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/main/storage/objectrotator.go#L15
update PackageMetadata::SyncService
- add a TTL of 5 minutes
- after every checkpoint make sure that duration is not greater than ttl-approximate_runtime_for_a_chunk https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L38
update frequency of cronjob to 5 minutes https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/initializers/1_settings.rb#L834

Edited Apr 03, 2023 by Adam Cohen