Long-running finalize migrations causing unacceptable upgrade downtimes for large self-managed deployments - EnsureAgainBackfillForCiPipelineVariablesPipelineIdIsFinished, EnsureBackfillBigintIdIsCompleted
Problem to solve
The have been two cases recently where customers performing upgrades have experienced db migrations running for much longer than expected:
- while upgrading from 16.0.8->16.3.3 instance was down for 4 hours while the
20230701053315 EnsureAgainBackfillForCiPipelineVariablesPipelineIdIsFinished
migration processed 54 millionci_pipeline_variables
rows (ZD internal link) - while upgrading from 15.11.8->16.3 instance was down for 4 hours while the
20230517005523 EnsureBackfillBigintIdIsCompleted
migration ran
The Upgrade Docs call out selected batched background migrations which may take multiple days to complete on large systems and warn customers to ensure these migrations have completed before upgrading to the version containing the migration which runs ensure_batched_background_migration_is_finished
to wait for any outstanding batched jobs to be processed, for example Long-running user type data change and BackfillPreparedAtMergeRequests. This effectively means treating the versions containing the batched background migrations as required stops on the upgrade path.
However, neither of the above migrations were mentioned in the Upgrade Docs as potentially running for an extended period when upgrading across multiple versions so the customers were unaware that by upgrading across the version containing the batched background migrations they would incur a prolonged downtime.
The question this issue poses is how best to reconcile the benefits of being able to upgrade (with downtime) across multiple minor versions with the requirement to not have unpredictable and indefinite outages.
Further details
EnsureAgainBackfillForCiPipelineVariablesPipelineIdIsFinished migration details:
-
20230609065942_backfill_ci_pipeline_variables_for_pipeline_id_bigint_conversion
- batched background migration to populate the new column with existing values from the original column - applied in version 16.2 -
20230701053315_ensure_again_backfill_for_ci_pipeline_variables_pipeline_id_is_finished
- post migration to ensure any pending batched background migration jobs are allowed to complete before applying subsequent background migrations to create required index on new column, in preparation for column renaming - applied in version 16.3
EnsureBackfillBigintIdIsCompleted migration details:
-
20230427065942_backfill_ci_pipeline_variables_for_bigint_conversion
- batched background migration to populate the new bigintid
column with values from the int column - applied in version 16.0 -
20230517005523_ensure_backfill_bigint_id_is_completed
- post migration to ensure any pending batched background migration jobs are allowed to complete before applying subsequent background migrations to create required index on new column, in preparation for column renaming - applied in version 16.1
Proposal
Three options present themselves to help customer avoid this situation:
- Provide a rows per minute estimate for every background migration when it is released that customers can use to compare to their record counts and ascertain whether they should to include that version on their upgrade path; or
- Recommend all customers with selected database tables containing over a certain number of rows only upgrade one minor version at a time (as required by the zero-downtime upgrade process)
- Retrospectively update the Upgrade Docs with migration-specific advisories after an issue with a particular migration is reported by a customer
Option 1 is best for customers as it provides them with the necessary information to determine for themselves approximately how long an outage window a particular upgrade is likely to be based on their upgrade path and allows them to chose whether to add extra stops on their upgrade path or not.
Who can address the issue
The Database group can advise on the viability of estimating migration processing rates for Option 1.
The Distribution::Deploy group can advise on the pros/cons of changing the upgrade path advice for Option 2.
I'm not sure who should make the final call on which Option to apply.