Skip to content

Batched background migration marked as finished, but there are failed jobs

This is extracted from #341663 (comment 691232113).

Customer reported that they have batched background migration marked as finished:

gitlabrds=> SELECT * FROM batched_background_migrations WHERE table_name = 'ci_build_needs'\gx
-[ RECORD 1 ]-----+-----------------------------------------------
id                | 7
created_at        | 2021-08-17 17:05:43.264787+00
updated_at        | 2021-08-18 03:00:05.094282+00
min_value         | 1
max_value         | 25201
batch_size        | 20000
sub_batch_size    | 1000
interval          | 120
status            | 3
job_class_name    | CopyColumnUsingBackgroundMigrationJob
batch_class_name  | PrimaryKeyBatchingStrategy
table_name        | ci_build_needs
column_name       | id
job_arguments     | [["build_id"], ["build_id_convert_to_bigint"]]
total_tuple_count | 22649
pause_ms          | 100

but some jobs are actually marked as failed:

gitlabrds=> SELECT status, COUNT(*) FROM batched_background_migration_jobs WHERE batched_background_migration_id = 7 GROUP BY status;
 status | count
--------+-------
      2 |     1
      3 |     1
(2 rows)

Indeed, there are number of rows that were not migrated:

gitlabrds=> CREATE INDEX CONCURRENTLY tmp_index_ci_build_needs_not_migrated ON ci_build_needs (build_id_convert_to_bigint) WHERE build_id_convert_to_bigint = 0;
CREATE INDEX
gitlabrds=> SELECT COUNT(*) FROM ci_build_needs WHERE build_id_convert_to_bigint = 0;
 count
-------
  3150
(1 row)

From a quick look at the related code, this should not be possible, but we may have some edge case, race condition - https://gitlab.com/gitlab-org/gitlab/-/blob/7202bb889de8525d0e395b0dd4eccc42425fca9b/lib/gitlab/database/background_migration/batched_migration_runner.rb#L118-122.