Add an administrative method to manage Worker instance processing by Sidekiq

Release notes

Add a Worker deny-list in the GitLab web application admin panel which would support effective deferral of such workers from processing jobs in a Sidekiq queue until removed from the deny-list.

Problem to solve

Sometimes, as was the case in this severity1 incident, despite our best efforts to avoid contentious behavior from Worker instances, one's infrastructure resources (such as database and database connection pool) become saturated by runaway Worker instances.

GitLab currently has no simple way to prevent further enqueuing of resource-contentious worker instances that is easily and quickly accessible during a crisis.

Note that the problem to solve here is not to terminate existing jobs, but to prevent certain classes of Workers from picking up enqueued jobs, and instead, ideally re-enqueue the jobs into a deferral queue for later processing or processing at a reduced rate, if necessary. This way, jobs which are found to cause resource saturation may be paused -- allowing degradation to be mitigated as immediately as possible.

Proposal

Workers to be denied enqueuing for processing would be specified by class name or namespace.

Administrator users would navigate to a panel section and be presented with instructions for usage, and a text field that includes a list of identifiers (ideally this field would produce a drop-down suggestions for Worker class identifier names).

Worker instances which would have been enqueued would be placed in a persistent defer queue.

Worker instances would replace their jobs into their original target queue and resume normal processing after an update to the deny-list was found to have had their identifier removed from the deny-list.

Alternatively, but perhaps far more complex to implement would be an explicit rate-limit per minute per Worker identified by namespace and class name. This would enable an effective deny-list when a given Worker was assigned an effective rate-limit of zero / minute or "[0] per minute" in the interface.

This could possibly be implemented in a manner similar to the Gitaly shard balancing interface, where each shard is assigned a weight. In this case, each new rate-limit entry would be created for a given Worker. All other workers would be processed normally.

My understanding is that this effective feature might be possible to implement at the Sidekiq level using some elevated tier of Sidekiq. However, this is sometimes not always possible for all self-managed installations, and in the case of the GitLab.com SaaS platform, it might not be financially preferable. Plus, this would dog-food such a feature into the GitLab product, which I think would be an important achievement and very desirable result.

Intended users

Sidney

Feature Usage Metrics

The actual value would be a mitigation result from a degradation scenario in which some Worker instance had caused infrastructure resource saturation or contention, preventing the processing of other jobs in sidekiq.

Besides incident time-to-mitigation reduction metrics, one might be able to track usage through either number of Worker instances queued for deferral over some interval, or number of total Worker instances designated for processing deferral or rate-limiting per GitLab installation.

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited Apr 28, 2023 by Nels Nelson