17.5 to 17.8 upgrade: BBM fails with insert or update on table "group_type_ci_runner_machines_687967fa8a" violates foreign key constraint "fk_rails_3f92913d27"
Summary
Customer self managed environment has Batched migration in failed state after upgrading from 17.5.5 to 17.8.2, which is our downtime upgrade path.
We found this error in the logs:
2025-02-17_12:02:04.45383 ERROR: insert or update on table "group_type_ci_runner_machines_687967fa8a" violates foreign key constraint "fk_rails_3f92913d27"
2025-02-17_12:02:04.45385 DETAIL: Key (runner_id, runner_type)=(1234, 2) is not present in table "group_type_ci_runners_e59bb2812d".
2025-02-17_12:02:04.45385 STATEMENT: /*application:sidekiq,correlation_id:05ca7788742fcf6ebf8f6b74ea85ed67,jid:cdf08799680ab356b1d18057,endpoint_id:Database::BatchedBackgroundMigration::MainExecutionWorker,db_config_database:gitlabhq_production,db_config_name:main*/ INSERT INTO ci_runner_machines_687967fa8a (id, runner_id, executor_type, created_at, updated_at, contacted_at, version, revision, platform, architecture, ip_address, config, system_xid, creation_state, runner_type, sharding_key_id)
2025-02-17_12:02:04.45386 SELECT id, runner_id, executor_type, created_at, updated_at, contacted_at, version, revision, platform, architecture, ip_address, config, system_xid, creation_state, runner_type, sharding_key_id FROM "ci_runner_machines" WHERE "ci_runner_machines"."id" BETWEEN 1 AND 5375 AND "ci_runner_machines"."id" >= 1 AND ("ci_runner_machines"."runner_type" = 1 OR "ci_runner_machines"."sharding_key_id" IS NOT NULL)
2025-02-17_12:02:04.45389 FOR UPDATE
2025-02-17_12:02:04.45389 ON CONFLICT (id, runner_type) DO NOTHING
Related:
- Ticket (internal link): https://gitlab.zendesk.com/agent/tickets/605599
- Finalize BackfillCiRunnerMachinesPartitionedTable (!177745 - merged)
- 2024-10-31: Runner verification API returning 500 (gitlab-com/gl-infra/production#18792 - closed)
The constraint was removed (!171308 (merged)) and re-introduced in 17.8 (!171848 (merged)), is this too early given that our published upgrade stop is 17.8?
Steps to reproduce
Example Project
What is the current bug behavior?
Batched migration fails with constraint error
What is the expected correct behavior?
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Reconstruction of history of problematic migrations
Migrations present as of %17.9
| Number | Migration | Initial milestone | Comments | Finalized |
|---|---|---|---|---|
| 1 | db/post_migrate/20241023144448_queue_backfill_partition_ci_runners.rb |
%17.6 | %17.8: no-oped | %17.9 - 20250113151324 |
| 2 | db/post_migrate/20241107064635_queue_backfill_ci_runner_machines_partitioned_table.rb |
%17.7 | %17.9 - 20250113163026 | |
| 3 | db/post_migrate/20241211072300_retry_add_fk_from_partitioned_ci_runner_managers_to_partitioned_ci_runners.rb |
%17.8 | ||
| 4 | db/post_migrate/20241219100359_queue_copy_runner_taggings.rb |
%17.8 |
%17.10: re-ordered to 20250103092422_queue_copy_runner_taggings.rb in !184341 (merged)
|
%17.9 - 20250114135714 |
| 5 | db/post_migrate/20250103092422_requeue_backfill_ci_runners_partitioned_table.rb |
%17.8 |
%17.10: re-ordered to 20241219100359_requeue_backfill_ci_runners_partitioned_table.rb in !184341 (merged)
|
%17.9 - 20250113151324 |
| 6 | db/post_migrate/20250113151324_finalize_backfill_ci_runners_partitioned_table.rb |
%17.9 | ||
| 7 | db/post_migrate/20250113163026_finalize_backfill_ci_runner_machines_partitioned_table.rb |
%17.9 |
The ci_runners_e59bb2812d table was then renamed to ci_runners in 20250305070000_replace_ci_runners_with_partitioned_table2 (%17.10).
Migration present as of %17.10 (after re-sequencing fix in !184341 (merged))
| Number | Migration | Initial milestone | Comments | Finalized |
|---|---|---|---|---|
| 1 | db/post_migrate/20241023144448_queue_backfill_partition_ci_runners.rb |
%17.6 | %17.8: no-oped | %17.9 - 20250113151324 |
| 2 | db/post_migrate/20241107064635_queue_backfill_ci_runner_machines_partitioned_table.rb |
%17.7 | %17.9 - 20250113163026 | |
| 3 | db/post_migrate/20241211072300_retry_add_fk_from_partitioned_ci_runner_managers_to_partitioned_ci_runners.rb |
%17.8 | ||
| 4 | db/post_migrate/20241219100359_requeue_backfill_ci_runners_partitioned_table.rb |
%17.8 |
%17.10: re-ordered to 20250103092422_requeue_backfill_ci_runners_partitioned_table.rb in !184341 (merged)
|
%17.9 - 20250113151324 |
| 5 | db/post_migrate/20250103092422_queue_copy_runner_taggings.rb |
%17.8 |
%17.10: re-ordered from 20241219100359_queue_copy_runner_taggings.rb in !184341 (merged)
|
%17.9 - 20250114135714 |
| 6 | db/post_migrate/20250113151324_finalize_backfill_ci_runners_partitioned_table.rb |
%17.9 | ||
| 7 | db/post_migrate/20250113163026_finalize_backfill_ci_runner_machines_partitioned_table.rb |
%17.9 |
The problem that users who haven't yet migrated will face when migrating on the %17.11 codebase is that the db/post_migrate/20241107064635_queue_backfill_ci_runner_machines_partitioned_table.rb migration (number 2 in table above) will execute before the db/post_migrate/20241219100359_requeue_backfill_ci_runners_partitioned_table.rb migration (number 4), which will cause the problem we're seeing here, where records referenced by one of the ci_runner_machines_687967fa8a partitions will reference an empty ci_runners_e59bb2812d table partition.

