17.5 to 17.8 upgrade: BBM fails with insert or update on table "group_type_ci_runner_machines_687967fa8a" violates foreign key constraint "fk_rails_3f92913d27"

Summary

Customer self managed environment has Batched migration in failed state after upgrading from 17.5.5 to 17.8.2, which is our downtime upgrade path.

image

image

We found this error in the logs:

2025-02-17_12:02:04.45383 ERROR:  insert or update on table "group_type_ci_runner_machines_687967fa8a" violates foreign key constraint "fk_rails_3f92913d27"
2025-02-17_12:02:04.45385 DETAIL:  Key (runner_id, runner_type)=(1234, 2) is not present in table "group_type_ci_runners_e59bb2812d".
2025-02-17_12:02:04.45385 STATEMENT:  /*application:sidekiq,correlation_id:05ca7788742fcf6ebf8f6b74ea85ed67,jid:cdf08799680ab356b1d18057,endpoint_id:Database::BatchedBackgroundMigration::MainExecutionWorker,db_config_database:gitlabhq_production,db_config_name:main*/ INSERT INTO ci_runner_machines_687967fa8a (id, runner_id, executor_type, created_at, updated_at, contacted_at, version, revision, platform, architecture, ip_address, config, system_xid, creation_state, runner_type, sharding_key_id)
2025-02-17_12:02:04.45386       SELECT id, runner_id, executor_type, created_at, updated_at, contacted_at, version, revision, platform, architecture, ip_address, config, system_xid, creation_state, runner_type, sharding_key_id FROM "ci_runner_machines" WHERE "ci_runner_machines"."id" BETWEEN 1 AND 5375 AND "ci_runner_machines"."id" >= 1 AND ("ci_runner_machines"."runner_type" = 1 OR "ci_runner_machines"."sharding_key_id" IS NOT NULL)
2025-02-17_12:02:04.45389       FOR UPDATE
2025-02-17_12:02:04.45389       ON CONFLICT (id, runner_type) DO NOTHING

Related:

The constraint was removed (!171308 (merged)) and re-introduced in 17.8 (!171848 (merged)), is this too early given that our published upgrade stop is 17.8?

Steps to reproduce

Example Project

What is the current bug behavior?

Batched migration fails with constraint error

What is the expected correct behavior?

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Reconstruction of history of problematic migrations

Migrations present as of %17.9

Number Migration Initial milestone Comments Finalized
1 db/post_migrate/20241023144448_queue_backfill_partition_ci_runners.rb %17.6 %17.8: no-oped %17.9 - 20250113151324
2 db/post_migrate/20241107064635_queue_backfill_ci_runner_machines_partitioned_table.rb %17.7 %17.9 - 20250113163026
3 db/post_migrate/20241211072300_retry_add_fk_from_partitioned_ci_runner_managers_to_partitioned_ci_runners.rb %17.8
4 db/post_migrate/20241219100359_queue_copy_runner_taggings.rb %17.8 %17.10: re-ordered to 20250103092422_queue_copy_runner_taggings.rb in !184341 (merged) %17.9 - 20250114135714
5 db/post_migrate/20250103092422_requeue_backfill_ci_runners_partitioned_table.rb %17.8 %17.10: re-ordered to 20241219100359_requeue_backfill_ci_runners_partitioned_table.rb in !184341 (merged) %17.9 - 20250113151324
6 db/post_migrate/20250113151324_finalize_backfill_ci_runners_partitioned_table.rb %17.9
7 db/post_migrate/20250113163026_finalize_backfill_ci_runner_machines_partitioned_table.rb %17.9

The ci_runners_e59bb2812d table was then renamed to ci_runners in 20250305070000_replace_ci_runners_with_partitioned_table2 (%17.10).

Migration present as of %17.10 (after re-sequencing fix in !184341 (merged))

Number Migration Initial milestone Comments Finalized
1 db/post_migrate/20241023144448_queue_backfill_partition_ci_runners.rb %17.6 %17.8: no-oped %17.9 - 20250113151324
2 db/post_migrate/20241107064635_queue_backfill_ci_runner_machines_partitioned_table.rb %17.7 %17.9 - 20250113163026
3 db/post_migrate/20241211072300_retry_add_fk_from_partitioned_ci_runner_managers_to_partitioned_ci_runners.rb %17.8
4 db/post_migrate/20241219100359_requeue_backfill_ci_runners_partitioned_table.rb %17.8 %17.10: re-ordered to 20250103092422_requeue_backfill_ci_runners_partitioned_table.rb in !184341 (merged) %17.9 - 20250113151324
5 db/post_migrate/20250103092422_queue_copy_runner_taggings.rb %17.8 %17.10: re-ordered from 20241219100359_queue_copy_runner_taggings.rb in !184341 (merged) %17.9 - 20250114135714
6 db/post_migrate/20250113151324_finalize_backfill_ci_runners_partitioned_table.rb %17.9
7 db/post_migrate/20250113163026_finalize_backfill_ci_runner_machines_partitioned_table.rb %17.9

The problem that users who haven't yet migrated will face when migrating on the %17.11 codebase is that the db/post_migrate/20241107064635_queue_backfill_ci_runner_machines_partitioned_table.rb migration (number 2 in table above) will execute before the db/post_migrate/20241219100359_requeue_backfill_ci_runners_partitioned_table.rb migration (number 4), which will cause the problem we're seeing here, where records referenced by one of the ci_runner_machines_687967fa8a partitions will reference an empty ci_runners_e59bb2812d table partition.

Possible fixes

Edited by Pedro Pombeiro