Sidekiq shard got stuck on Geo::RepositoryVerification::Primary::ShardWorker

Summary

GitLab.com separates our sidekiq fleets into various shards to ensure various jobs run with a certain level of urgency and to ensure we have the necessary resources to execute. We wrote a blog post: https://about.gitlab.com/blog/2020/06/24/scaling-our-use-of-sidekiq/

In our staging environment, the catchall fleet filled up with jobs from Geo::RepositoryVerification::Primary::ShardWorker which proceeded to get stuck in some sort of loop, with a log message forever incrementing: Loop 428. Due to this, as jobs were cancelled, our catchall shard scaled down (since we currently drive autoscaling based on CPU resource usage), which ended up causing ALL jobs assigned to this shard to appear hung.

Steps to reproduce

TBD

What is the current bug behavior?

It would appeared we filled the queue with a job that is unable to succeed? This forced the entire shard to scale down. Open questions:

What is the failure scenario we may have run into?
How long does it take for this job to fail?
Can we make it fail quicker?
Should SaaS GitLab.com consider moving this queue to a differing shard?

What is the expected correct behavior?

Ultimately, GitLab.com should scale based on queue length instead of CPU resource utilization. Though we may run into a situation where we hit the ceiling of our autoscale capabilities, and all Pods are then processing this failure scenario.

Let's try and answer the above questions such that we can determine the appropriate resolution.

Relevant logs and/or screenshots

ElasticSearch: https://nonprod-log.gitlab.net/goto/6eec3f3ecfa6671baadc67ae765c5d2c (Internal)

Source (Internal)

Output of checks

This bug happened on staging.gitlab.com

Possible fixes

Edited Apr 16, 2021 by John Skarbek