Skip to content

Update package_metadata sync concurrency

What does this MR do and why?

PackageMetadata::SyncWorker synchronizes csv data from a gcp bucket or file against the GitLab instance's database using bulk insert. The job is long-running (especially on initial import) which causes multiple instances of the worker to run at the same time. The bug affects performance by causing a significant slowdown in bulk upsert query speed. 10x slow-downs are common for when 2 or more workers are doing bulk inserts concurrently. Deadlocks are less frequent but do occur causing the job to fail.

In the normal flow of events the job is started by Sidekiq::Cron at the next scheduled interval. Multiple jobs are prevented from starting by the Deduplication Middleware.

The error state happens when sidekiq shuts down due to a node restart or a re-deploy (this is frequent). When this happens, the running job is terminated but pushed back to the default queue to allow another sidekiq worker to pick up the job. Now there are two triggers for the job. (1) A normal start involving Sidekiq::Cron and (2) a start by another node's worker picking up the job from the default queue.

This is a rough visualization of what happens.

sequenceDiagram
    participant Redis
    participant Cron as Sidekiq::Cron
    participant DedupeMiddleware
    participant Sidekiq
    participant Sync as SyncWorker1
    participant Sync2 as SyncWorker2
    activate Redis
    activate Sidekiq
    activate Cron

    Note right of Cron: (1) cron tries to schedule job
    Cron->>DedupeMiddleware: is_duplicate?(sync_job)
    DedupeMiddleware-->>Cron: false
    Cron->>Sidekiq: schedule sync_job
    Sidekiq->>Sync: start
    activate Sync
    Sync->>Sync: work

    Note right of Cron: (2) cron tries to schedule job
    Cron->>DedupeMiddleware: is_duplicate?(sync_job)
    DedupeMiddleware-->>Cron: true
    Cron->>Cron: noop

    Note right of Sidekiq: (3) sidekiq node shuts down
    Sidekiq--xSync: shutdown
    deactivate Sync
    Sidekiq->>Redis: add sync_job to default queue
    deactivate Cron
    deactivate Sidekiq
    
    Note right of Sidekiq: (4) sidekiq node restarts
    activate Sidekiq
    activate Cron
    
    Sidekiq->>Redis: get default queue jobs
    Redis-->>Sidekiq: [..., sync_job, ...]
    Sidekiq->>Sync: start
    activate Sync
    Sync->>Sync: work

    Note right of Cron: (5) cron tries to schedule job
    Cron->>DedupeMiddleware: is_duplicate?(sync_job)
    DedupeMiddleware-->>Cron: false
    Cron->>Sidekiq: schedule sync_job
    Sidekiq->>Sync2: start
    activate Sync2
    Sync2->>Sync2: work

    deactivate Sync
    deactivate Sync2
    deactivate Cron
    deactivate Sidekiq
    deactivate Redis

The changes in this MR

This MR changes the sync job to take a global lock (via Gitlab:ExclusiveLease) so that only a single PackageMetadata::SyncWorker job may run at any given time.

Alongside the global lock, the job run time is reduced to 5 minutes. This makes job execution more predictable, reduces the chance of the job being interrupted by a sidekiq restart, and abides by the sidekiq development guidelines regarding a 5 minute maximum: https://docs.gitlab.com/ee/development/sidekiq/worker_attributes.html

The job also self-preempts based on the global lock's ttl. After every checkpoint PackageMetadata::SyncService checks whether it's out of time. If it is, the job stops.

The deduplication strategy is also removed to ensure that ExclusiveLease is the SSOT for uniquely identifying the job.

How to set up and validate locally

Before this fix

  1. git checkout master
  2. start sidekiq and tail the log for sync worker messages (bundle exec sidekiq | tee -a /tmp/sidekiq.log, `tail -f /tmp/sidekiq.log| jq 'select(.class == "PackageMetadata::SyncWorker") | .' )
  3. wait for cron to start the job (can bump frequency to every minute)
  4. terminate sidekiq ps | grep sidekiq, kill -TERM [PID] (there will be a delay while sidekiq tries to shutdown gracefully)
  5. start sidekiq again
  6. verify 2 or more PackageMetadata::SyncWorker job_status: start messages (without corresponding job_status: done messages almost immediately after) in the logs

After the fix

  1. git checkout 393433-reduce-chance-of-concurrent-sync-jobs
  2. start sidekiq and tail the log for sync worker messages (bundle exec sidekiq | tee -a /tmp/sidekiq.log, `tail -f /tmp/sidekiq.log| jq 'select(.class == "PackageMetadata::SyncWorker") | .' )
  3. wait for cron to start the job (can bump frequency to every minute)
  4. terminate sidekiq ps | grep sidekiq, kill -TERM [PID]
  5. start sidekiq again
  6. verify only 1 PackageMetadata::SyncWorker job_status: start message without a corresponding done; other start messages are followed by a deduplication message

Note, to make it more predictable change the code of SyncService#execute to a simple loop with a log message and sleep. Example:

class SyncService
  def execute
    count = 0
    loop do
      count += 1
      Gitlab::PackageMetadata::Logger.debug(class: self.class.name, message: "In execute loop (iteration: #{count}")
      sleep(2)
    end
  end
end 

Gitlab::PackageMetadata::Logger debug messages will appear under $GITLAB_DIR/log/package_metadata.log

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #393433 (closed)

Edited by Igor Frenkel

Merge request reports