17.1.x conditional upgrade stop for FinalizeBackfillPartitionIdCiPipelineMessage DB migration

This planning issue is to discuss whether 17.1 should be made a conditional required upgrade stop.

Condition that may require an upgrade stop for version 17.1

Summary

The post-deployment FinalizeBackfillPartitionIdCiPipelineMessage migration adds a column called partition_id to the ci_pipeline_messages table and backfills the column for every row, using the values from ci_pipelines.partition_id. This table can become very large, with millions or tens of millions of rows. Depending on the table size and the speed at which the rows are updated, this migration can delay an upgrade by anywhere from 1 to 30+ hours.

Proposal: add a conditional required upgrade stop at 17.1, based on the size of the customer's ci_pipeline_messages table.

Examples

In the last 3 days, we have had two large ARR customers trigger emergencies with GitLab Support during an upgrade from 16.11 to 17.2/17.3. In both cases, the customer believed that the DB migration had stalled on FinalizeBackfillPartitionIdCiPipelineMessage, which is a 17.2 migration. This issue caused them to run over their scheduled maintenance windows.

Customer A

Customer A had been waiting on their DB migrations to finish for over 6 hours when they triggered the emergency. They had almost 63 million rows in the ci_pipeline_messages table, and rows were only getting updated at a rate of ~30,000/minute with a bundled PG database. At this rate, it would take 35 hours to complete that single DB migration.

Customer B

Customer B had been waiting for FinalizeBackfillPartitionIdCiPipelineMessage to complete for 45-50 minutes when they triggered the emergency. They had ~4.5 million rows in ci_pipeline_messages, which were getting updated at a rate of ~45,000/minute with an RDS database. The delay forced them to extend their maintenance window by an hour.

Notes for the issue author only

After the issue creation

Slack message template:

The Distribution::Deploy group created an issue (link to this issue) to determine if 17.1 needs to be a required upgrade stop. Please review your upcoming changes and share any may require upgrade stop on the issue (link to this issue), thank you.

  • Update "Next Required Stop" bookmark in #g_distribution to this issue link.
  • Update EWIR.
  • Use the previous Slack message template to post to #engineering-fyi and cross post to:
    • #eng-managers
    • #cto

After the decision is made

If 17.1 is an upgrade stop

Slack message template:

An update on the next upgrade stop (link to this issue), 17.1 is a planned upgrade stop. It is a great opportunity to plan tasks as mentioned on Adding required stops and Avoiding required stops.

  • Comment on this issue.
  • Update EWIR.
  • Use the previous Slack message template to post to #engineering-fyi and cross post to:
    • #eng-managers
    • #cto
    • #whats-happening-at-gitlab
    • #support_self-managed

If 17.1 is not an upgrade stop

Slack message template:

An update on the next upgrade stop (link to this issue), 17.1 is NOT a planned upgrade stop.

  • Comment on this issue.
  • Update EWIR.
  • Use the previous Slack message template to post to #engineering-fyi and cross post to:
    • #eng-managers
    • #cto

CCs

Edited by Ben Prescott (ex-GitLab)