Staging sidekiq best-effort cluster flattened by never-ending background migrations
Since 2020-01-14, the 3 besteffort sidekiq nodes have been pegged at 100% CPU. It appears this is the background_migration queue (BackgroundMigrationWorker), processing (or attempting to process) many many ActivatePrometheusServicesForSharedClusterApplications
jobs.
sidekiq_queue_size{name="background_migration", fqdn=~"redis-sidekiq-01-db-gstg.*"}
from thanos (to get the long view):
This may be causing problems for other besteffort jobs being processed in an effective fashion, and is almost definitely not something that we want to go on forever. It's not yet 100% clear whether there are just a lot of jobs to process and it's just taking a long time, but the 2-week form of that graph doesn't imbue a great deal of confidence that it will ever finish/self-correct.