Defer Sidekiq jobs via feature flags

Context

As proposed in option 2 in gitlab-org/gitlab#408520 (comment 1372009316), we want to use feature flags to control whether jobs from certain Sidekiq workers should be deferred, i.e. in an incident where runaway workers are saturating DB resources.

The benefit of using feature flags is that the changes could take effect quickly without restarting Sidekiq pods. Existing multi-layers of caching from feature flags implementation also comes in handy.

Proposal

1. Using `worker` actor

We could extend our current feature flags to add a new actor type worker. An operator could simply run ChatOps like:

/chatops run feature set --worker=SlowRunningWorker defer_sidekiq_job true

Once the worker is deemed safe to run normally, one could turn off the feature flags:

/chatops run feature set --worker=SlowRunningWorker defer_sidekiq_job false

Usage of the feature flag looks something like this:

Feature.enabled?(:defer_sidekiq_jobs, Gitlab::SidekiqDeferredWorker.new(worker_klass), type: :ops)

Tradeoffs:

(+) Explicit feature flag defined as defer_sidekiq_jobs
(-) Couldn't support rate-limiting/throttling like mechanism out of the box. percentage of time works as a separate gate, not within an actor (more in #2336 (comment 1383287150))
- An alternative could be implemented with passing percentage of time value as the actor value itself #2336 (comment 1385582509). However, this approach would likely confuse the operators managing the feature flags more.

2. Using `worker` type

This approach allows us to dynamically generate the feature flag with the worker name.

A cycle of fully deferring, slowly releasing using percentage of time, and finally fully releasing the jobs, would look something like:

# defer 100% of the jobs
/chatops run feature set defer_sidekiq_job:SlowRunningWorker true 

# defer 99% of the jobs, only letting 1% processed
/chatops run feature set defer_sidekiq_job:SlowRunningWorker 99 

# defer 50% of the jobs
/chatops run feature set defer_sidekiq_job:SlowRunningWorker 50 

# stop deferring the jobs, back to normal
/chatops run feature set defer_sidekiq_job:SlowRunningWorker false

The feature flags usage in Rail would look like:

Feature.enabled?(:"defer_sidekiq_jobs:AuthorizedProjectsWorker", type: :worker, default_enabled_if_undefined: false)

Tradeoffs:

(+) percentage of time is supported out of the box, just like a regular feature flag usage.
(+) Simpler to implement.
(-) Lack of hygiene and trackability as the feature definitions are not committed as yml files like other development or ops flags.

Tasks

Proposal 1:

Extend feature flags API to support worker actor type
Support the worker actor type in ChatOps
Implement deferring Sidekiq jobs as a Sidekiq server middleware (inspiration)
~~(Optional) Experiment if it's possible to combine percentage of time (feature from Flipper) with the worker actor type as a way to implement rate limiting based on randomness.~~ Not possible as described in #2336 (comment 1383287150)

Proposal 2:

Add worker type in feature flags definition
Implement deferring Sidekiq jobs as a Sidekiq server middleware (inspiration)

Other tasks regardless of implementation:

Test the behavior in non-prod environment
Document in runbooks on the usage and example
Announce the feature in Slack channels (#infrastructure-lounge)

Discussions

How long should jobs be deferred?

If the interval is too short, Sidekiq workers will end up processing too many deferred jobs, which will also end up being deferred again. This could affect the performance of other workers in the same shard.

If the interval is too long, the jobs might be deferred too long once an incident is resolved and the feature flag is turned off.

We could decide this based on various factors:

Average time to resolution of major incidents
Oncall duration (?) (I think it shouldn't be longer than a single SRE oncall duration)

Some values in mind: 5m, 30m, 1h

What if someone forgets to turn off the feature flag?

We have alerts for Sidekiq jobs that are being enqueued without being dequeued (https://gitlab.com/gitlab-com/runbooks/blob/aefea8f01e4612e846627e27b70aa4933b2f6247/thanos-rules/autogenerated-sidekiq-alerts-gprd.yml#L259), but it only applies for urgency=throttled workers. We might want to think of a similar alerting mechanism that might be specific to this use case.

Edited May 12, 2023 by Marco Gregorius