Defer Sidekiq jobs via feature flags
Context
As proposed in option 2 in gitlab-org/gitlab#408520 (comment 1372009316), we want to use feature flags to control whether jobs from certain Sidekiq workers should be deferred, i.e. in an incident where runaway workers are saturating DB resources.
The benefit of using feature flags is that the changes could take effect quickly without restarting Sidekiq pods. Existing multi-layers of caching from feature flags implementation also comes in handy.
Proposal
1. Using worker actor
We could extend our current feature flags to add a new actor type worker. An operator could simply run ChatOps like:
/chatops run feature set --worker=SlowRunningWorker defer_sidekiq_job true
Once the worker is deemed safe to run normally, one could turn off the feature flags:
/chatops run feature set --worker=SlowRunningWorker defer_sidekiq_job false
Usage of the feature flag looks something like this:
Feature.enabled?(:defer_sidekiq_jobs, Gitlab::SidekiqDeferredWorker.new(worker_klass), type: :ops)
Tradeoffs:
- (+) Explicit feature flag defined as
defer_sidekiq_jobs - (-) Couldn't support rate-limiting/throttling like mechanism out of the box.
percentage of timeworks as a separate gate, not within an actor (more in #2336 (comment 1383287150))- An alternative could be implemented with passing
percentage of timevalue as the actor value itself #2336 (comment 1385582509). However, this approach would likely confuse the operators managing the feature flags more.
- An alternative could be implemented with passing
2. Using worker type
This approach allows us to dynamically generate the feature flag with the worker name.
A cycle of fully deferring, slowly releasing using percentage of time, and finally fully releasing the jobs, would look something like:
# defer 100% of the jobs
/chatops run feature set defer_sidekiq_job:SlowRunningWorker true
# defer 99% of the jobs, only letting 1% processed
/chatops run feature set defer_sidekiq_job:SlowRunningWorker 99
# defer 50% of the jobs
/chatops run feature set defer_sidekiq_job:SlowRunningWorker 50
# stop deferring the jobs, back to normal
/chatops run feature set defer_sidekiq_job:SlowRunningWorker false
The feature flags usage in Rail would look like:
Feature.enabled?(:"defer_sidekiq_jobs:AuthorizedProjectsWorker", type: :worker, default_enabled_if_undefined: false)
Tradeoffs:
- (+)
percentage of timeis supported out of the box, just like a regular feature flag usage. - (+) Simpler to implement.
- (-) Lack of hygiene and trackability as the feature definitions are not committed as
ymlfiles like otherdevelopmentoropsflags.
Tasks
Proposal 1:
-
Extend feature flags API to support workeractor type -
Support the workeractor type in ChatOps -
Implement deferring Sidekiq jobs as a Sidekiq server middleware (inspiration) -
(Optional) Experiment if it's possible to combineNot possible as described in #2336 (comment 1383287150)percentage of time(feature from Flipper) with theworkeractor type as a way to implement rate limiting based on randomness.
Proposal 2:
-
Add workertype in feature flags definition -
Implement deferring Sidekiq jobs as a Sidekiq server middleware (inspiration)
Other tasks regardless of implementation:
-
Test the behavior in non-prod environment -
Document in runbookson the usage and example -
Announce the feature in Slack channels (#infrastructure-lounge)
Discussions
How long should jobs be deferred?
If the interval is too short, Sidekiq workers will end up processing too many deferred jobs, which will also end up being deferred again. This could affect the performance of other workers in the same shard.
If the interval is too long, the jobs might be deferred too long once an incident is resolved and the feature flag is turned off.
We could decide this based on various factors:
- Average time to resolution of major incidents
- Oncall duration (?) (I think it shouldn't be longer than a single SRE oncall duration)
Some values in mind: 5m, 30m, 1h
What if someone forgets to turn off the feature flag?
We have alerts for Sidekiq jobs that are being enqueued without being dequeued (https://gitlab.com/gitlab-com/runbooks/blob/aefea8f01e4612e846627e27b70aa4933b2f6247/thanos-rules/autogenerated-sidekiq-alerts-gprd.yml#L259), but it only applies for urgency=throttled workers. We might want to think of a similar alerting mechanism that might be specific to this use case.