Skip to content

Defer sidekiq jobs based on database health status indicators

What does this MR do and why?

This is part III of #404898 (closed). !121261 (merged) exposed variables needed to define the health check context, this MR performs the database health check and defers (re-queues) the sidekiq job if database signals to stop.

Approach:

We extend the existing Sidekiq DeferJobs middleware to perform a check on database health status as well.

Feature flag:

This feature is behind a FF (defer_sidekiq_workers_on_database_health_signal), rollout issue: #412990 (closed)


P.S: The initial effort was done in MR#119187 which has the entire changeset and became huge, so split it into smaller MRs to have more control.

How to set up and validate locally

  1. Enable 'defer_sidekiq_workers_on_database_health_signal' FF from the rails console, Feature.enable(:defer_sidekiq_workers_on_database_health_signal).
  2. Create a new worker or choose an existing worker, for this I am using 'Chaos::SleepWorker'.
  3. Let's set 'database_health_check_attrs' for the worker, eg:
     # frozen_string_literal: true
    
     module Chaos
       class SleepWorker # rubocop:disable Scalability/IdempotentWorker
         ...
         ...
         defer_on_database_health_signal :gitlab_main, 1.minute, [:users]
    
         def perform(duration_s)
           Gitlab::Chaos.sleep(duration_s)
         end
       end
     end
  4. Chaos::SleepWorker.defer_on_database_health_signal? should be returning positive now.
  5. Make defer_job_by_database_health_signal? to return true locally (so that we don't have to prepare for actual db health status evaluation)
  6. reload! the local rails console if you are using the same console.
  7. tail jobs logs, gdk tail rails-background-jobs | grep Chaos::SleepWorker
    • If needed please restart gdk restart rails-background-jobs locally
  8. performing the job, should schedule it after a minute instead of executing immediately.
    pry(main)> Chaos::SleepWorker.perform_async(1)
    => "9e402fc977d7bcf32597fe91"
    
    pry(main)> queue = Sidekiq::ScheduledSet.new
    pry(main)> queue.map { |job| job }
    => [#<Sidekiq::SortedEntry:0x00000001495b7788
     @args=nil,
     @item=
     {"retry"=>3,
       "queue"=>"default",
       "backtrace"=>true,
       "version"=>0,
       "queue_namespace"=>"chaos",
       "class"=>"Chaos::SleepWorker",
       "args"=>[1],
       "jid"=>"d51303502cb5f6849488961b",
       "created_at"=>1685034663.062152,
       "correlation_id"=>"b8b450db286639352dd5195d6c85ce13",
       "meta.caller_id"=>"Chaos::SleepWorker",
       "meta.feature_category"=>"not_owned",
       "meta.root_caller_id"=>"Chaos::SleepWorker",
       "worker_data_consistency"=>"always",
       "size_limiter"=>"validated",
       "scheduled_at"=>1685034723.062093},
     @parent=#<Sidekiq::ScheduledSet:0x000000014950ccc0 @_size=1, @name="schedule">,
     @queue="default",
     @score=1685034723.062093,
     @value=
     "{\"retry\":3,\"queue\":\"default\",\"backtrace\":true,\"version\":0,\"queue_namespace\":\"chaos\",\"class\":\"Chaos::SleepWorker\",\"args\":[1],\"jid\":\"d51303502cb5f6849488961b\",\"created_at\":1685034663.062152,\"correlation_id\":\"b8b450db286639352dd5195d6c85ce13\",\"meta.caller_id\":\"Chaos::SleepWorker\",\"meta.feature_category\":\"not_owned\",\"meta.root_caller_id\":\"Chaos::SleepWorker\",\"worker_data_consistency\":\"always\",\"size_limiter\":\"validated\",\"scheduled_at\":1685034723.062093}">]
  9. Note the 'jid' for later, and 'scheduled_at' is after a minute from 'created_at'
  10. After a minute, we should be able to see logs coming in (7) - from rails-background-jobs
  11. On executing the job now, it will again re-queue (because of (5)) but with different 'jid' than the previous one.
    pry(main)> queue = Sidekiq::ScheduledSet.new
    pry(main)> queue.map { |job| job }
    # We should be able to see another job scheduled after a minute of prev job execution, with new 'jid'

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #404898 (closed)

Edited by Prabakaran Murugesan

Merge request reports