Put in place measures to avoid addition/proliferation of GitLab upgrade path stops
This issue was created with contributions from @anton and @kballon.
Summary
The GitLab upgrade path is getting increasingly complex and arbitrary, creating a poor experience for system administrators responsible for maintaining their GitLab environments and causing unexpected and unplanned down/maintenance time for organisations and their end-users.
We should investigate the causes which introduce these seemingly arbitrary upgrade stops, and explore the measures we can take to reduce or eliminate these from being introduced in future versions.
Overview
In the past, GitLab major upgrades were straightforward, and could be summarised as follows:
- Upgrade to the latest minor version of the preceding major version.
- Upgrade to the "dot zero" release of the next major version (
X.0.Z
).
This was easy to understand, and upgrade stops in a multi-version upgrade path were predictable.
In the last three major versions, this is no longer the case, with multiple additional upgrade stops in seemingly arbitrary minor versions being mandated, otherwise an upgrade would be incomplete.
Version 11 to 12
0 additional stops: 11.0.Z
-> 11.11.8
(last minor ver.) -> 12.0.12
Note: This is the last major version where we kept to the most straightforward upgrade path.
Version 12 to 13
1 addition stop: 12.0.12
-> 12.1
-> 12.10
(last minor ver.) -> 13.0.14
Expand for details on each stop
-
12.1
was introduced as a stop in %13.6 because theResetMergeStatus
background migration needs to be able to access themerge_requests.state
column in order to complete when triggered from12.1+
. This column was removed in12.10
via this commit so if users upgraded directly from12.0.Z -> 12.10.Z
, the user would end up with a lot of stuckResetMergeStatus
background jobs (as per this issue) and would be blocked from upgrading to13.X
.- Docs MR: !46632 (merged)
- Backport request (declined due to length of time passed): gitlab-org/release/tasks#1753
Version 13 to 14
2 addition stops: 13.0.14
-> 13.1.11
-> 13.8.8
-> 13.12.15
(last minor ver.) -> 14.0.12
Expand for details on each stop
-
13.1
includes a Rails version change from 6.0.3 to 6.0.3.1 - it looks related to some security vulnerabilities within Rails . The Rails upgrade included a change to CSRF token generation which is not backwards-compatible - GitLab servers with the new Rails version generate CSRF tokens that are not recognizable by GitLab servers with the older Rails version - which could cause non-GET requests to fail for multi-node GitLab installations. It appears this stop was only included for multi node GitLab installations and it technically could be skipped for single node installs.- Docs MR: !34116 (merged)
-
13.8
includes a background migration to address an issue with duplicate service records. If duplicate services are present, this background migration must complete before a unique index is applied to the services table, which was introduced in GitLab 13.9. Upgrades from GitLab13.8
and earlier to later versions must include an intermediate upgrade to GitLab13.8.8
before proceeding.- Docs MR: !68874 (merged)
Version 14 to 15
2 additional stops: 14.0.12
-> 14.3.6
-> 14.9.5
-> 14.10.Z
(last minor ver.) -> 15.0.Z
Expand for details on each stop
-
14.3
includes a batched background migrationMigrateMergeRequestDiffCommitUsers
. This migration might take hours or days to complete on larger GitLab instances.14.5
foregrounded this migration, resulting in a number of large instances run into unexpected downtime beyond scheduled maintenance windows. This resulted in14.3.6
being added to the upgrade path, but only ~7 months after release date.- Docs MR: !89621 (merged)
- Additional context: #375553 (comment 1117399427)
-
14.9
includes a batched background migrationBackfillAllProjectNamespaces
to ensure corresponding records innamespaces
table for each record inprojects
table.14.10
includes a batched background migrationBackfillNamespaceIdForProjectRoutes
dependent on the former migration being completed in full, resulting in14.9
being a required upgrade stop.- Note: This migration might take hours or days to complete on larger GitLab instances.
- Docs MR: !86029 (merged)
Version 15 to 16
1 additional stop so far: 15.0.Z
-> 15.4.0
-> TBD
Expand for details on each stop
-
15.4
includes a batched background migration to remove incorrect values fromexpire_at
inci_job_artifacts
table. This background migration needs to be complete before some temporary code is removed from15.6
, making it a required upgrade stop.- This migration might take hours or days to complete on larger GitLab instances.
Review
The required upgrade stops fall into these categories:
- Code changes dependent on previous database schema changes was introduced (
12.1
,13.8
). - Code changes dependent on data manipulation/consistency was introduced (
14.9
,15.4
).- Potentially long-running background migration forced to foreground (
14.3
).
- Potentially long-running background migration forced to foreground (
- Security fix (
13.1
) -- although it's unclear if this is a strongly required stop.
Potential causes
Description and discussion in comments.
- Unsafe assumptions and blindspots regarding GitLab environment upgrades in the wild
- Dev teams prioritise GitLab feature development velocity above keeping upgrade path simple
- Unintentional coupling of GitLab SaaS (GitLab.com) concerns with the GitLab codebase
Other notes
- Since %15.0 we test all background migrations on our database testing pipelines using thin clones - this will help us identify similar issues early and before the migrations are released
- We have switched to using batched background migrations by default and we'll soon stop using regular background migrations - That will remove the issues with thousands or tens of thousands of jobs being queued but will not solve the issues with skipping versions
- We are now also inlining batched background migrations - that addresses issues with small/medium instances but we have to also continue the discussion on the timeouts.
Brainstorm ways for background migrations to be finalized without introducing a required upgrade step - #357561