Throttle Sidekiq jobs based on DB usage
What does this MR do and why?
This merge request adds a new throttling system for background job workers to prevent database overload. The system monitors how much database time workers are using and automatically reduces their concurrency (number of jobs running simultaneously) when they exceed limits.
Key additions include:
- A new test worker that can sleep the database for testing purposes
- A throttling middleware that sits between job requests and execution
- Logic to decide between "soft throttling" (20% reduction) and "hard throttling" (50% reduction) based on database activity
- A tracking system to remember which workers are currently being throttled
- Integration with existing concurrency limiting infrastructure
The system works by checking 2 indicators:
- if a worker is using too much database time (https://gitlab.com/gitlab-org/gitlab/blob/mg%2Fsidekiq-throttling-middleware/lib/gitlab/sidekiq_limits.rb#L8-8)
- database connection statistics to see if that worker is dominating database usage. (sampled from https://gitlab.com/gitlab-org/gitlab/blob/mg%2Fsidekiq-throttling-middleware/lib/gitlab/database/stat_activity_sampler.rb#L17-32)
Throttling conditions:
- If both indicators are violated, hard throttling is applied.
- If only indicator 1 is violated, soft throttling is applied
- Otherwise, no throttling (therefore no throttling when only indicator 2 violated)
This helps prevent any single type of background job from overwhelming the database and affecting overall system performance.
References
First task of gitlab-com/gl-infra/observability/team#3815 (closed)
Broken down from !190565 (closed)
How to set up and validate locally
-
Apply the following diff. The DB duration threshold is lowered so we can test the throttling easily
diff --git a/config/initializers/0_marginalia.rb b/config/initializers/0_marginalia.rb index af76e9f048d5..49004db6f4d0 100644 --- a/config/initializers/0_marginalia.rb +++ b/config/initializers/0_marginalia.rb @@ -12,7 +12,7 @@ # We only enable this in production because a number of tests do string # matching against the raw SQL, and prepending the comment prevents color # coding from working in the development log. -Marginalia::Comment.prepend_comment = true if Rails.env.production? +Marginalia::Comment.prepend_comment = true Marginalia::Comment.components = [:application, :correlation_id, :jid, :endpoint_id, :db_config_database, :db_config_name, :console_hostname, :console_username] diff --git a/lib/gitlab/sidekiq_limits.rb b/lib/gitlab/sidekiq_limits.rb index ec0ef816f252..f574db020074 100644 --- a/lib/gitlab/sidekiq_limits.rb +++ b/lib/gitlab/sidekiq_limits.rb @@ -3,7 +3,7 @@ module Gitlab module SidekiqLimits HIGH_URGENCY_DB_DURATION_THRESHOLD_SECONDS = 100_000 - DEFAULT_DB_DURATION_THRESHOLD_SECONDS = 20_000 + DEFAULT_DB_DURATION_THRESHOLD_SECONDS = 5 DEFAULT_SIDEKIQ_LIMITS = { main_db_duration_limit_per_worker: {
-
Restart sidekiq
-
Enable the feature flag
Feature.enable(:sidekiq_throttling_middleware)
-
While tailing the sidekiq logs in another terminal, schedule
Chaos::DbSleepWorker
job:Chaos::DbSleepWorker.perform_async(5)
-
Wait for the job to finish by looking for
"job_status": "done"
in the sidekiq log -
Schedule another job:
Chaos::DbSleepWorker.perform_async(1)
-
You should immediately see the limit is throttled in the log
-
Try scheduling a job again in the same minute, there shouldn't be any throttle.
-
Repeat on the next minute, and the limit will be further throttled.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.