Infradev: Sidekiq shard/DB saturation Security::ScanResultPolicies::AddApproversToRulesWorker

Summary

During incident gitlab-com/gl-infra/production#17692 (closed), an ultimate group was shared with gitlab-org, this triggered the creation of 4.2M Security::ScanResultPolicies::AddApproversToRulesWorker jobs that saturated both the Sidekiq shard and database connections.

More details and logs in the incident issue.

Impact

Sidekiq catchall shard saturation, increasing job queueing and execution latency. Database connections were also saturated, further creating pressure on the shard/queue as jobs waited for DB connections.

Recommendation

Review worker batching strategy, from a quick glance, we are creating 1 job per project and then batch processing 100 users at a time.

Batch processing multiple projects per job could help reduce the number of jobs.

Overall as this kind of jobs are neither critical or urgent, we should think about throttling execution due to the potential risk of saturation when it comes to group wide policies.

Verification

Security::ScanResultPolicies::AddApproversToRulesWorker have an upper limit on how many jobs are created/scheduled during a certain interval, either by batching projects (reducing number of jobs) and/or throttling job creation.

Edited by Filipe Santos