draft: [PoC] Sidekiq throttling based on DB usage
What does this MR do and why?
PoC for throttling sidekiq workers based on database usage
- Set max concurrency limit for all workers to 3000.
- If a worker has a defined
concurrency_limit
attribute, this value will be used as the max concurrency limit - This is the starting concurrency limit subject to throttling.
- If a worker has a defined
- Track dynamic concurrency limit per worker in Redis
- Sidekiq middleware that does the following:
- Throttle worker based on:
- client-side DB duration exceeding quota (static limit) ONLY --> Soft Throttle
- client-side DB duration AND DB-side active connections (from pg_stat_activity) dominates other workers --> Hard Throttle
-
TODO: EWMA implementation gitlab-com/gl-infra/data-access/durability/team#145(left out for now, not to add more complexity)
- Track that the worker has been throttled in the current minute (akin to
ApplicationRateLimiter
bucketing logic).- Ensures we don't throttle multiple times in the current minute.
- Throttle worker based on:
-
Background taskCronjob running every minute that recovers the current limit if there is no throttling event in the previous minute.- This means if a worker is last throttled at minute
01
, the soonest the recovery could happen is at minute03
(t + 2 mins), and every minute thereafter.
- This means if a worker is last throttled at minute
- New sidekiq admin page for SRE knobs !190565 (comment 2554332920):
- Tune current limit per worker
- Disable/enable concurrency limit
- Filter by worker name
Glossary
- Throttle - Lower current concurrency limit in Redis.
SidekiqMiddleware::ConcurrencyLimit::Middleware
is responsible for actually throttling/deferring the job into a queue - Soft Throttle -
current_limit * 0.8
- Hard Throttle -
current_limit * 0.5
- Gradual Recovery -
max(current_limit + 1, current_limit * 1.1)
- (Factors above to be discussed)
Local Demo
-
Prepare 2 terminals, 1 to schedule worker, 1 to check logs with
gdk tail rails-background-jobs | rg Chaos::DbSleepWorker
-
Schedule the new DbSleepWorker in Rails console:
Chaos::DbSleepWorker.perform_async(5)
-
Once the job execution is actually done (check Sidekiq logs), schedule the same one (repeat step 1)
-
The worker is now throttled
2025-06-04_16:28:37.48706 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T16:28:37.486Z","class":"Chaos::DbSleepWorker","throttling_decision":"HardThrottle","message":"Chaos::DbSleepWorker is being throttled with strategy HardThrottle","retry":0} 2025-06-04_16:28:37.48865 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T16:28:37.488Z","message":"Throttled Chaos::DbSleepWorker to 1500"}
-
Schedule the worker again, make sure it's within the same minute:
2025-06-04_16:28:50.47870 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T16:28:50.478Z","message":"already_throttled Chaos::DbSleepWorker true"}
This simulates once the worker is already throttled in the current minute, it won't be throttled again.
-
Let it be for a while, the
RecoveryWorker
cronjob will start recovering the limit every minute until it's back to max limit.2025-06-04_17:01:05.45256 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:01:05.452Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1500,"new_limit":1651,"max_limit":3000,"retry":0} 2025-06-04_17:02:04.80048 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:02:04.799Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1651,"new_limit":1817,"max_limit":3000,"retry":0} 2025-06-04_17:03:07.64308 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:03:07.642Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1817,"new_limit":1999,"max_limit":3000,"retry":0} 2025-06-04_17:04:04.97366 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:04:04.973Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1999,"new_limit":2199,"max_limit":3000,"retry":0} 2025-06-04_17:05:05.41103 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:05:05.410Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2199,"new_limit":2419,"max_limit":3000,"retry":0} 2025-06-04_17:06:03.76573 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:06:03.765Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2419,"new_limit":2661,"max_limit":3000,"retry":0} 2025-06-04_17:18:07.59620 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:18:07.596Z","message":"recovery_worker Chaos::DbSleepWorker current_limit 2661 max_limit 3000"} 2025-06-04_17:18:07.60128 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:18:07.600Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2661,"new_limit":2928,"max_limit":3000,"retry":0} 2025-06-04_17:19:03.25203 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:19:03.251Z","message":"recovery_worker Chaos::DbSleepWorker current_limit 2928 max_limit 3000"} 2025-06-04_17:19:03.25463 rails-background-jobs : {"severity":"INFO","time":"2025-06-04T17:19:03.254Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2928,"new_limit":3000,"max_limit":3000,"retry":0}
This means if a worker is last throttled at minute
01
, the soonest the recovery could happen is at minute03
(t + 2 mins), and every minute thereafter.
References
gitlab-com/gl-infra/observability/team#3815 (closed)
Screenshots or screen recordings
Before | After |
---|---|
How to set up and validate locally
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.