Skip to content

draft: [PoC] Sidekiq throttling based on DB usage

What does this MR do and why?

PoC for throttling sidekiq workers based on database usage

  • Set max concurrency limit for all workers to 3000.
    • If a worker has a defined concurrency_limit attribute, this value will be used as the max concurrency limit
    • This is the starting concurrency limit subject to throttling.
  • Track dynamic concurrency limit per worker in Redis
  • Sidekiq middleware that does the following:
    • Throttle worker based on:
      • client-side DB duration exceeding quota (static limit) ONLY --> Soft Throttle
      • client-side DB duration AND DB-side active connections (from pg_stat_activity) dominates other workers --> Hard Throttle
      • TODO: EWMA implementation gitlab-com/gl-infra/data-access/durability/team#145 (left out for now, not to add more complexity)
    • Track that the worker has been throttled in the current minute (akin to ApplicationRateLimiter bucketing logic).
      • Ensures we don't throttle multiple times in the current minute.
  • Background task Cronjob running every minute that recovers the current limit if there is no throttling event in the previous minute.
    • This means if a worker is last throttled at minute 01, the soonest the recovery could happen is at minute 03 (t + 2 mins), and every minute thereafter.
  • New sidekiq admin page for SRE knobs !190565 (comment 2554332920):
    • Tune current limit per worker
    • Disable/enable concurrency limit
    • Filter by worker name

Glossary

  • Throttle - Lower current concurrency limit in Redis. SidekiqMiddleware::ConcurrencyLimit::Middleware is responsible for actually throttling/deferring the job into a queue
  • Soft Throttle - current_limit * 0.8
  • Hard Throttle - current_limit * 0.5
  • Gradual Recovery - max(current_limit + 1, current_limit * 1.1)
  • (Factors above to be discussed)

Local Demo

  1. Prepare 2 terminals, 1 to schedule worker, 1 to check logs with gdk tail rails-background-jobs | rg Chaos::DbSleepWorker

  2. Schedule the new DbSleepWorker in Rails console:

    Chaos::DbSleepWorker.perform_async(5)
  3. Once the job execution is actually done (check Sidekiq logs), schedule the same one (repeat step 1)

  4. The worker is now throttled

    2025-06-04_16:28:37.48706 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T16:28:37.486Z","class":"Chaos::DbSleepWorker","throttling_decision":"HardThrottle","message":"Chaos::DbSleepWorker is being throttled with strategy HardThrottle","retry":0}
    2025-06-04_16:28:37.48865 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T16:28:37.488Z","message":"Throttled Chaos::DbSleepWorker to 1500"}
  5. Schedule the worker again, make sure it's within the same minute:

    2025-06-04_16:28:50.47870 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T16:28:50.478Z","message":"already_throttled Chaos::DbSleepWorker true"}

    This simulates once the worker is already throttled in the current minute, it won't be throttled again.

  6. Let it be for a while, the RecoveryWorker cronjob will start recovering the limit every minute until it's back to max limit.

    2025-06-04_17:01:05.45256 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:01:05.452Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1500,"new_limit":1651,"max_limit":3000,"retry":0}
    2025-06-04_17:02:04.80048 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:02:04.799Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1651,"new_limit":1817,"max_limit":3000,"retry":0}
    2025-06-04_17:03:07.64308 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:03:07.642Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1817,"new_limit":1999,"max_limit":3000,"retry":0}
    2025-06-04_17:04:04.97366 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:04:04.973Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":1999,"new_limit":2199,"max_limit":3000,"retry":0}
    2025-06-04_17:05:05.41103 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:05:05.410Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2199,"new_limit":2419,"max_limit":3000,"retry":0}
    2025-06-04_17:06:03.76573 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:06:03.765Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2419,"new_limit":2661,"max_limit":3000,"retry":0}
    2025-06-04_17:18:07.59620 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:18:07.596Z","message":"recovery_worker Chaos::DbSleepWorker current_limit 2661 max_limit 3000"}
    2025-06-04_17:18:07.60128 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:18:07.600Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2661,"new_limit":2928,"max_limit":3000,"retry":0}
    2025-06-04_17:19:03.25203 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:19:03.251Z","message":"recovery_worker Chaos::DbSleepWorker current_limit 2928 max_limit 3000"}
    2025-06-04_17:19:03.25463 rails-background-jobs   : {"severity":"INFO","time":"2025-06-04T17:19:03.254Z","message":"Recovering concurrency limit for worker","worker_name":"Chaos::DbSleepWorker","previous_limit":2928,"new_limit":3000,"max_limit":3000,"retry":0}

    This means if a worker is last throttled at minute 01, the soonest the recovery could happen is at minute 03 (t + 2 mins), and every minute thereafter.

References

gitlab-com/gl-infra/observability/team#3815 (closed)

Screenshots or screen recordings

Before After

How to set up and validate locally

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Marco Gregorius

Merge request reports

Loading