Although it is possible to do upgrades without a prior backup, no customer that values their data and/or service availability will perform an upgrade without first backing up their installation.
Backups for large multi-node installations is hard.
You generally want to do a backup just before performing an upgrade. Backups may automated and performed during low-usage hours which may not be a convenient time to do an upgrade..
General upgrade observations
It is recommended to wait up to a week between upgrades, especially for minor to major number upgrades. These waiting periods can stretch out an upgrade when going through multiple upgrade stops.
Re-installs are required if upgrading to a version that is not supported on the current operating system.
Upgrading to a new version may also require an upgrade to either the provided or external PostgreSQL version requiring extras steps.
Deprecations may require manual configuration steps before or after an upgrade.
GitLab runner must be upgraded to the same version.
Linux package installs with single-node observations
PostgreSQL upgrades may need double the existing database size in spare disk space.
Can upgrade to next upgrade stop.
Zero down-time upgrades observations
There is no zero-downtime upgrade for a single node installation.
Have to upgrade every point release.
Cannot do major version PostgreSQL upgrades with zero downtime.
Need to drain rails nodes before upgrade.
Node upgrade sequence is important.
- Lots of hand steps. At minimum, for each node:
sudo touch /etc/gitlab/skip-auto-reconfigure
Upgrade package
sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart
Rails node upgrades are particularly complex.
Multi-node upgrades with downtime
Can upgrade to next upgrade stop.
Node upgrade sequence is less important.
Helm chart upgrades
Relatively straight-forward compared to multi-node Linux package upgrades.
Internal PostgreSQL chart upgrades require special handling.
There is no zero-downtime option.
Operator upgrades
Very straight-forward, just update the version in the GitLab CR.
No zero-downtime option.
No instructions for PostgreSQL upgrade.
Upgrade path tool
The upgrade path tool is very helpful for showing which upgrade stops are required to get from one version to the desired end version. It also shows:
It is hoped that by identifying current pain points, we can categorize and prioritize changes to the product to remove, ameliorate, and/or or minimize them so customers will upgrade more frequently.
General observations
The following observations apply to all installation methods.
Backups needed
Backups are needed prior to doing upgrades. Performing backups thus become part of the upgrade process and extends the overall amount of time an upgrade takes to complete. Probably, the only way to remove this requirement is to make upgrades so reliable that backups are not necessary.
We can't really do much about this problem as customers are almost always going to want to backup their systems before an upgrade (which we also encourage) other than making backups more convenient.
Waiting for migrations to complete
Administrators must wait until migrations are complete when upgrading over a migration stop. This process can take up to a week for large installations which can make upgrading over several upgrade stops require large calendar time and require several sets of backup/maintenance periods.
We can't really do much about this the time for migrations to complete as migration completion is a requirement of upgrades. We might reduce the overall calendar time required for subsequent upgrades by notifying the administrator(s) when migrations complete after an upgrade.
The upgrade path tool is very helpful. It, however, does not support helm or operator upgrades. It can still be used as a reference to show, for example, deprecations between releases.
When upgrading across many versions, it can be difficult to determine which deprecations apply. The upgrade tool is very valuable reporting which deprecations have occurred but it does not filter out the deprecations that do not apply to an installation. For example, upgrading from 16.8.10 to 17.4.1 shows 89 deprecations. It would be very helpful to have a tool that examines an installation and reports which deprecations apply for a given upgrade.
How to determine if an installation is ready for an upgrade
The installation needs to be in a given state before an upgrade is started. For example, all background migrations should be finished. The current documentation can scatter this information, can be incomplete, or can be confusing. It would be nice to have a combined tool that would show if the system is in a state ready for upgrades. For example, it could report:
Not all components can be upgraded with zero down time
In GET installations, not all nodes are configured so that they can be upgraded with zero down-time zero downtime upgrades. This may be true even when not using GET as not all components may be configured for zero downtime.
PostgreSQL upgrades require downtime
PostgreSQL upgrades needed to to get the the next major version require downtime.
There is an Ansible playbooks available to do zero down time upgrades for installations configured with GET. Note that this playbook only covers components configured with GET. Other components, e.g. praefect-postgres with either have to use some other method or the playbook configuration will need to be modified to cover these missed components.
Multi-node HA upgrades with downtime.
No additional pain points were discovered when testing multi-node HA upgrades with downtime.
Helm chart upgrades
This section describes pain points discovered during testing upgrades of helm chart installations.
No zero down time option
There is no zero down time upgrade option for Helm installations.
No chart upgrade documentation for hybrid installations
The existing chart upgrade installation does not discuss (or only indirectly) how to update hybrid installations. Explicit documentation for hybrid installations should be included given that we only support this configuration in production.
Bundled PostgreSQL upgrades require special handling
Upgrading the bundled PostgreSQL charts to use the next PG major number may require special handling. This may not be an issue given that we tell customers not to use the bundled PostgreSQL in production.
Operator upgrades
This section describes pain points discovered during testing upgrades of Operator installations.
No zero down time option
There is no zero down time upgrade option for Operator installations.
No operator upgrade documentation for hybrid installations
The existing operator upgrade installation does not discuss (or only indirectly) how to update hybrid installations. Explicit documentation for hybrid installations should be included given that we only support this configuration in production.