pg_upgrade fails on single-node Geo secondary (GitLab 15.11 / PostgreSQL 12 -> 13)
Summary
Customer tried to follow the documented steps to perform the major upgrade from PostgreSQL 12 to PostgreSQL 13. pg_upgrade fails having upgraded the data for geo-postgresql and then failed to start geo-postgresql because the PostgreSQL 12 binaries are still linked.
Predated by GitLab Geo 15.11 Database Upgrade failing (#7822 - closed) which has a different workaround.
Steps to reproduce
- Build a Geo environment with a single Omnibus node as the Geo secondary.
- Follow the documented PostgreSQL upgrade steps.
(We've not yet worked out what other conditions are required to reproduce this)
What is the current bug behavior?
pg_upgrade upgrades geo-postgresql from PG12 to PG13, and then fails with:
/opt/gitlab/embedded/service/omnibus-ctl/lib/postgresql.rb:26:in
`rescue in wait_for_postgresql': Timed out waiting for PostgreSQL to start (Timeout::Error)
gitlab-ctl tail geo-postgresql shows repeated:
unrecognized configuration parameter "wal_keep_size" in
file "/var/opt/gitlab/geo-postgresql/data/runtime.conf" line 30
Manually removing this from the configuration results in the PostgreSQL binaries reading /var/opt/gitlab/geo-postgresql/data/PG_VERSION and then instead erroring about the major version mis-match.
What is the expected correct behavior?
pg_upgrade should not be attempting to start geo-postgresql at this point.
It needs to first upgrade postgresql OR relink the binaries. Or both. But it does neither.
Workaround
Click here to expand
A lot of the steps are similar to the process when a Patroni replica fails to upgrade.
Backout process is documented in a comment below
-
check: The procedure for upgrading PostgreSQL when using Geo needs to be have been followed:
- The slot name has been identified.
- You have the replication user's password.
- You've attempted
sudo gitlab-ctl pg-upgradeand it failed withrescue in wait_for_postgresql': Timed out waiting for PostgreSQL to start (Timeout::Error)
-
check: state of
geo-postgresql; it should be erroring withunrecognized configuration parametersudo gitlab-ctl tail geo-postgresql -
check: Verify that the primary database is upgraded
sudo egrep '^[0-9]' /var/opt/gitlab/postgresql/data/PG_VERSIONResult confirms it's upgraded, as follows. No other steps are required on the primary.
/var/opt/gitlab/postgresql/data/PG_VERSION:13Verify that the two databases on the secondary are different releases:
sudo egrep '^[0-9]' /var/opt/gitlab/geo-postgresql/data/PG_VERSION /var/opt/gitlab/postgresql/data/PG_VERSIONThe result must be that
geo-postgresqlhas upgraded butpostgresqlhas not./var/opt/gitlab/geo-postgresql/data/PG_VERSION:13 /var/opt/gitlab/postgresql/data/PG_VERSION:12The process only fixes the
postgresqlservice, by initializing it from scratch and replicating it from the primary. This can't be done withgeo-postgresqland this process is not suitable for that. This must show as upgraded already.If you run
sudo gitlab-ctl pg-upgradea second time, it rolls backgeo-postgresqlto PostgreSQL 12; this is one way thatgeo-postgresqlwould be in the wrong state for this workaround. -
check: Verify what binaries are live
sudo ls -al /opt/gitlab/embedded/bin | grep postgresCheck the symbolic link paths: the majority should be
/opt/gitlab/embedded/postgresql/12. There maybe both later an earlier versions for one or two, but the majority should be one version. -
workaround commences here. Please ensure the system matches the previous checks/pre-requsites before proceeding. If you ran
sudo gitlab-ctl pg-upgrademore than once, it's likely that it will be reverted the system to PostgreSQL 12.== Reverted to 12.12. Please check output for what went wrong ==Re-run
pg-upgradeagain to upgrade thegeo-logcursordatabase and then run through the previous steps to confirm the system is now in the correct state. The workaround does not upgrade thegeo-logcursordatabase. -
Stop all services.
gitlab-ctl reconfiguretakes different code paths when the PostgreSQL services are running.sudo gitlab-ctl stop -
Add
postgresql['version'] = 13togitlab.rb -
Relink the binaries to version 13.
sudo gitlab-ctl reconfigure -
Verify what binaries are live - it should now be the PostgreSQL 13 path.
sudo ls -al /opt/gitlab/embedded/bin | grep postgres -
Rename the
postgresqldata directory.data_12is not used to avoid interfering with normal Omnibus upgrade/downgrade procedures.sudo mv /var/opt/gitlab/postgresql/data /var/opt/gitlab/postgresql/data_12_issue7841 -
Recreate the
datadirectory. This will also start uppostgresql. It should be the only thing running.sudo gitlab-ctl reconfigure sudo gitlab-ctl status -
Re-initialize replication
sudo gitlab-ctl replicate-geo-database --slot-name=SECONDARY_SLOT_NAME \ --host=PRIMARY_HOST_NAME --sslmode=verify-ca- This completes step 5 in the Geo upgrade process. It may take some time; it does a full initial replication of the whole database.
-
--sslmode=verify-cais essential. It's not in the upgrade documentation, but it is in the new-build documentation, step 3. - without this parameter, it'll default to
--sslmode=verify-full(postgresql docs) and require that the TLS certificate served by the primary contains the DNS record specified inSECONDARY_SLOT_NAME
-
All services should be up now, as the previous step starts them:
sudo gitlab-ctl status -
Check the databases are running:
sudo gitlab-ctl tail geo-postgresql sudo gitlab-ctl tail postgresql- errors in the main
postgresqllog about TLS validation, and the DNS record not matchingPostgreSQLindicate that replication was not set up with--sslmode=verify-ca. This process has to be completed again from the start, but this time no backup of thedatadirectory is required. Just remove it.
- errors in the main
-
Remove the hard coded PostgreSQL version from
gitlab.rb-postgresql['version'] = 13 -
Ensure that change is applied and also fix the PostgreSQL configuration. This completes step 6 in the Geo upgrade process.
sudo gitlab-ctl reconfigure -
Verify what binaries are live - it should still be the PostgreSQL 13 path.
sudo ls -al /opt/gitlab/embedded/bin | grep postgres -
Restart all processes. This completes step 7 in the Geo upgrade process. It differs in that here all processes are restarted to ensure all configuration changes are live.
sudo gitlab-ctl restartWait for all processes to start:
- Run
top -c -o RES - The
sidekiq,pumaandee/bin/geo_log_cursorprocesses will consume 900mb-1.1gb once fully initialised. - The top command will show their memory use increasing in the
REScolumn, one this stops that process is fully started. - Additionally, once Puma has initialised and is at peak memory use, it forks multiple puma workers.
- Run
-
Check Geo.
sudo gitlab-rake geo:status- Check the main database if there's an error about the database not replicating towards the top of the report:
sudo gitlab-ctl tail postgresql
- Check the main database if there's an error about the database not replicating towards the top of the report:
-
Once the upgrade is verified, check in the following locations for redundant
datadirectories. Thedatadirectory contains the live database - do not touch that. Any backups likedata_12,data_12_issue7841ordata.<number>can be removed./var/opt/gitlab/geo-postgresql /var/opt/gitlab/postgresql
Relevant logs
GitLab team members with Zendesk access can read more in the following customer ticket:
Details of package version
Ticket [1] 15.11.5
Configuration details
Ticket [1] configuration
per the documentation the Secondary is configured with roles(['geo_secondary_role'])