Batched background migrations stuck when upgrading Helm deployment with multiple sidekiq pods
Summary
When upgrading a Helm chart deployment from 15.4.6 to 15.7.2 batched background migrations were left stuck in Active state due to orphaned deduplication key.
This occurred because one of the 3 version 15.4.6 sidekiq pods started running the Database::BatchedBackgroundMigrationWorker before being terminated and replaced with a version 15.7.2 sidekiq pod.
The 15.4.6 sidekiq pod also logged an error due to the code for the new migrations not being available to it, before being terminated:
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq 2023-01-16T08:26:11.739Z pid=20 tid=7j40 WARN: NameError: uninitialized constant Gitlab::BackgroundMigration::BackfillNamespaceDetails
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq Did you mean? Gitlab::BackgroundMigration::BackfillNamespaceTraversalIdsRoots
The version 15.7.2 sidekiq pods repeatedly logged Database::BatchedBackgroundMigrationWorker JID-fbe00631a03e007de8a39693: deduplicated: dropped until executing messages and no batched background migration jobs were actually run.
Applying the How to clear a GitLab Sidekiq job deduplication idempotency key snippet against the Database::BatchedBackgroundMigrationWorker worker class caused the stuck jobs to start being processed.
The customer reported that down-scaling the number of sidekiq pods to zero before upgrading, and then scaling them back up afterwards, avoided the issue.
An Omnibus upgrade ensures sidekiq is stopped and updated before running migrations, so any post/batched migrations are run by the newly updated and restarted sidekiq.
I'm not sure what if anything can be done to implement the same sequence in a Helm chart deployment - do we instead need to add a step to the upgrade documentation to scale down the sidekiq pods to zero before upgrading the chart, and then scale them back up again once the new migrations have been deployed. Or perhaps disable the relevant sidekiq queue/cronjob in the UI before upgrading.
Steps to reproduce
Install GitLab version 15.4.6 into a Kubernetes cluster with 3 sidekiq pods configured in values.yaml:
gitlab:
sidekiq:
minReplicas: 3
maxReplicas: 3
Then upgrade to version 15.7.2 via helm upgrade --install -f gitlab.yaml -n gitlab gitlab gitlab/gitlab --version 6.7.2
Once upgraded, go to Admin Area, Monitoring, Background Migrations and observe several Active migrations that never progress.
Then, from the rails console, run the following:
worker_class=Database::BatchedBackgroundMigrationWorker
dj = Gitlab::SidekiqMiddleware::DuplicateJobs::DuplicateJob.new({ 'class' => worker_class.name, 'args' => [] }, worker_class.queue)
dj.delete!
The stuck jobs will then start being processed.
What is the current bug behavior?
Batched background migrations are stuck.
What is the expected correct behavior?
Batched background migrations run to completion.
Relevant logs and/or screenshots
Stuck migrations after upgrade complete:
Error from version 15.4.6 sidekiq pod during upgrade:
gitlab-migrations-2-ng97z migrations main: == 20220921093355 ScheduleBackfillNamespaceDetails: migrating =================
gitlab-migrations-2-ng97z migrations main: == 20220921093355 ScheduleBackfillNamespaceDetails: migrated (0.0474s) ========
gitlab-migrations-2-ng97z migrations
...
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq 2023-01-16T08:24:19.471Z pid=20 tid=7hew class=Database::BatchedBackgroundMigrationWorker jid=242989f689cd9d9179f4e6f4 elapsed=0.053 INFO: fail
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq 2023-01-16T08:24:19.473Z pid=20 tid=7hew WARN: {"context":"Job raised exception","job":{"retry":false,"queue":"cronjob:database_batched_background_migration","
version":0,"queue_namespace":"cronjob","args":[],"class":"Database::BatchedBackgroundMigrationWorker","jid":"242989f689cd9d9179f4e6f4","created_at":1673857459.4161348,"meta.caller_id":"Cronjob","correlation_id":
"4c8d1ff65c9aa401516fcc346f1cb7bd","meta.root_caller_id":"Cronjob","meta.feature_category":"database","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:cronjob:database_batched_backgr
ound_migration:592d9619e1997b640b70ce6a22f6713bc7793bb7a4e342b7380d90b691fcd6ae","enqueued_at":1673857459.4176576,"load_balancing_strategy":"primary","instrumentation":{"redis_calls":6,"redis_duration_s":0.02367
4,"redis_read_bytes":197,"redis_write_bytes":666,"redis_cache_calls":1,"redis_cache_duration_s":0.000324,"redis_cache_read_bytes":186,"redis_cache_write_bytes":35,"redis_queues_calls":3,"redis_queues_duration_s"
:0.018459,"redis_queues_read_bytes":9,"redis_queues_write_bytes":325,"redis_shared_state_calls":2,"redis_shared_state_duration_s":0.004891,"redis_shared_state_read_bytes":2,"redis_shared_state_write_bytes":306,"
db_count":3,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":3,"db_main_count":3,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_co
unt":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_m
ain_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.007,"db_main_duration_s":0.007,"db_main_replica_duration_s":0.0,"cpu_s":0.011116,"mem_objects":4
080,"mem_bytes":255368,"mem_mallocs":641,"mem_total_bytes":418568,"pid":20,"worker_id":"sidekiq_0","rate_limiting_gates":[]}},"jobstr":"{\"retry\":false,\"queue\":\"cronjob:database_batched_background_migration\
",\"version\":0,\"queue_namespace\":\"cronjob\",\"args\":[],\"class\":\"Database::BatchedBackgroundMigrationWorker\",\"jid\":\"242989f689cd9d9179f4e6f4\",\"created_at\":1673857459.4161348,\"meta.caller_id\":\"Cr
onjob\",\"correlation_id\":\"4c8d1ff65c9aa401516fcc346f1cb7bd\",\"meta.root_caller_id\":\"Cronjob\",\"meta.feature_category\":\"database\",\"worker_data_consistency\":\"always\",\"idempotency_key\":\"resque:gitl
ab:duplicate:cronjob:database_batched_background_migration:592d9619e1997b640b70ce6a22f6713bc7793bb7a4e342b7380d90b691fcd6ae\",\"enqueued_at\":1673857459.4176576}"}
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq 2023-01-16T08:24:19.547Z pid=20 tid=7hew WARN: NameError: uninitialized constant Gitlab::BackgroundMigration::BackfillNamespaceDetails
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq Did you mean? Gitlab::BackgroundMigration::BackfillNamespaceTraversalIdsRoots
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq 2023-01-16T08:24:19.547Z pid=20 tid=7hew WARN: /srv/gitlab/vendor/bundle/ruby/2.7.0/gems/activesupport-6.1.6.1/lib/active_support/inflector/methods.rb:288:in `const_get'
gitlab-sidekiq-all-in-1-v2-7f89cd769f-x4kkt sidekiq /srv/gitlab/vendor/bundle/ruby/2.7.0/gems/activesupport-6.1.6.1/lib/active_support/inflector/methods.rb:288:in `block in constantize'
Message logged by version 15.7.2 pod after upgrade:
gitlab-sidekiq-all-in-1-v2-6944f489fb-6llwj sidekiq 2023-01-16T08:50:02.946Z pid=25 tid=7fxh INFO: {"retry"=>false, "queue"=>"cronjob:database_batched_background_migration", "version"=>0, "queue_namespace"=>:cronjob, "args"=>[], "class"=>"Database::BatchedBackgroundMigrationWorker", "jid"=>"fbe00631a03e007de8a39693", "created_at"=>1673859002.9396782, "meta.caller_id"=>"Cronjob", "correlation_id"=>"288991f675155cb3f0d1f7d5374f4a85", "meta.root_caller_id"=>"Cronjob", "meta.feature_category"=>"database", "worker_data_consistency"=>:always, "idempotency_key"=>"resque:gitlab:duplicate:cronjob:database_batched_background_migration:592d9619e1997b640b70ce6a22f6713bc7793bb7a4e342b7380d90b691fcd6ae", "duplicate-of"=>"dc6fcf389af4a7ebada85845", "job_size_bytes"=>2, "pid"=>25, "job_status"=>"deduplicated", "message"=>"Database::BatchedBackgroundMigrationWorker JID-fbe00631a03e007de8a39693: deduplicated: dropped until executing", "deduplication.type"=>"dropped until executing"}
Output from running rails console commands to remove duplicate job:
irb(main):013:0> worker_class=Database::BatchedBackgroundMigrationWorker
=> Database::BatchedBackgroundMigrationWorker
irb(main):014:0> dj = Gitlab::SidekiqMiddleware::DuplicateJobs::DuplicateJob.new({ 'class' => worker_class.name, 'args' => [] }, worker_class.queue)
=> #<Gitlab::SidekiqMiddleware::DuplicateJobs::DuplicateJob:0x00007f29fe457340 @job={"class"=>"Database::BatchedBackgroundMigrationWorker", "args"=>[]}, @queue_name="cronjob:database_batched_background_migr...
irb(main):015:0> pp dj
#<Gitlab::SidekiqMiddleware::DuplicateJobs::DuplicateJob:0x00007f29fe457340
@job={"class"=>"Database::BatchedBackgroundMigrationWorker", "args"=>[]},
@queue_name="cronjob:database_batched_background_migration">
=> #<Gitlab::SidekiqMiddleware::DuplicateJobs::DuplicateJob:0x00007f29fe457340 @job={"class"=>"Database::BatchedBackgroundMigrationWorker", "args"=>[]}, @queue_name="cronjob:database_batched_background_migration">
irb(main):016:0> dj.delete!
=> 1
Possible fixes
- Document requirement to scale down running sidekiq pods to zero prior to upgrading chart, and then scale them back once upgrade complete.
- Document requirement to disable sidekiq batched background migration queue/cronjob before upgrading.
- Stop "old" sidekiq pods from processing jobs before migrations are applied and sidekiq pods cycled
- Clear out orphaned deduplication key(s) as part of upgrade process
