Skip to content

Staging sidekiq best-effort cluster flattened by never-ending background migrations

Since 2020-01-14, the 3 besteffort sidekiq nodes have been pegged at 100% CPU. It appears this is the background_migration queue (BackgroundMigrationWorker), processing (or attempting to process) many many ActivatePrometheusServicesForSharedClusterApplications jobs.

sidekiq_queue_size{name="background_migration", fqdn=~"redis-sidekiq-01-db-gstg.*"} from thanos (to get the long view):

image

This may be causing problems for other besteffort jobs being processed in an effective fashion, and is almost definitely not something that we want to go on forever. It's not yet 100% clear whether there are just a lot of jobs to process and it's just taking a long time, but the 2-week form of that graph doesn't imbue a great deal of confidence that it will ever finish/self-correct.