pg_upgrade fails on single-node Geo secondary (GitLab 15.11 / PostgreSQL 12 -> 13)

Summary

Customer tried to follow the documented steps to perform the major upgrade from PostgreSQL 12 to PostgreSQL 13. pg_upgrade fails having upgraded the data for geo-postgresql and then failed to start geo-postgresql because the PostgreSQL 12 binaries are still linked.

Predated by GitLab Geo 15.11 Database Upgrade failing (#7822 - closed) which has a different workaround.

Steps to reproduce

Build a Geo environment with a single Omnibus node as the Geo secondary.
Follow the documented PostgreSQL upgrade steps.

(We've not yet worked out what other conditions are required to reproduce this)

What is the current bug behavior?

pg_upgrade upgrades geo-postgresql from PG12 to PG13, and then fails with:

/opt/gitlab/embedded/service/omnibus-ctl/lib/postgresql.rb:26:in
 `rescue in wait_for_postgresql': Timed out waiting for PostgreSQL to start (Timeout::Error)

gitlab-ctl tail geo-postgresql shows repeated:

 unrecognized configuration parameter "wal_keep_size" in 
file "/var/opt/gitlab/geo-postgresql/data/runtime.conf" line 30

Manually removing this from the configuration results in the PostgreSQL binaries reading /var/opt/gitlab/geo-postgresql/data/PG_VERSION and then instead erroring about the major version mis-match.

What is the expected correct behavior?

pg_upgrade should not be attempting to start geo-postgresql at this point.

It needs to first upgrade postgresql OR relink the binaries. Or both. But it does neither.

Workaround

Click here to expand

A lot of the steps are similar to the process when a Patroni replica fails to upgrade.

Backout process is documented in a comment below

check: The procedure for upgrading PostgreSQL when using Geo needs to be have been followed:
- The slot name has been identified.
- You have the replication user's password.
- You've attempted sudo gitlab-ctl pg-upgrade and it failed with rescue in wait_for_postgresql': Timed out waiting for PostgreSQL to start (Timeout::Error)
check: state of geo-postgresql; it should be erroring with unrecognized configuration parameter
```
sudo gitlab-ctl tail geo-postgresql
```
check: Verify that the primary database is upgraded
```
sudo egrep '^[0-9]'  /var/opt/gitlab/postgresql/data/PG_VERSION
```
Result confirms it's upgraded, as follows. No other steps are required on the primary.
```
/var/opt/gitlab/postgresql/data/PG_VERSION:13
```
Verify that the two databases on the secondary are different releases:
```
sudo egrep '^[0-9]' /var/opt/gitlab/geo-postgresql/data/PG_VERSION  /var/opt/gitlab/postgresql/data/PG_VERSION
```
The result must be that geo-postgresql has upgraded but postgresql has not.
```
/var/opt/gitlab/geo-postgresql/data/PG_VERSION:13
/var/opt/gitlab/postgresql/data/PG_VERSION:12
```
The process only fixes the postgresql service, by initializing it from scratch and replicating it from the primary. This can't be done with geo-postgresql and this process is not suitable for that. This must show as upgraded already.

If you run sudo gitlab-ctl pg-upgrade a second time, it rolls back geo-postgresql to PostgreSQL 12; this is one way that geo-postgresql would be in the wrong state for this workaround.
check: Verify what binaries are live
```
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
```
Check the symbolic link paths: the majority should be /opt/gitlab/embedded/postgresql/12. There maybe both later an earlier versions for one or two, but the majority should be one version.
workaround commences here. Please ensure the system matches the previous checks/pre-requsites before proceeding. If you ran sudo gitlab-ctl pg-upgrade more than once, it's likely that it will be reverted the system to PostgreSQL 12.
```
== Reverted to 12.12. Please check output for what went wrong == 
```
Re-run pg-upgrade again to upgrade the geo-logcursor database and then run through the previous steps to confirm the system is now in the correct state. The workaround does not upgrade the geo-logcursor database.
Stop all services. gitlab-ctl reconfigure takes different code paths when the PostgreSQL services are running.
```
sudo gitlab-ctl stop
```
Add postgresql['version'] = 13 to gitlab.rb
Relink the binaries to version 13.
```
sudo gitlab-ctl reconfigure
```
Verify what binaries are live - it should now be the PostgreSQL 13 path.
```
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
```
Rename the postgresql data directory. data_12 is not used to avoid interfering with normal Omnibus upgrade/downgrade procedures.
```
sudo mv /var/opt/gitlab/postgresql/data /var/opt/gitlab/postgresql/data_12_issue7841
```
Recreate the data directory. This will also start up postgresql. It should be the only thing running.
```
sudo gitlab-ctl reconfigure
sudo gitlab-ctl status
```
Re-initialize replication
```
sudo gitlab-ctl replicate-geo-database --slot-name=SECONDARY_SLOT_NAME \
    --host=PRIMARY_HOST_NAME --sslmode=verify-ca
```
- This completes step 5 in the Geo upgrade process. It may take some time; it does a full initial replication of the whole database.
- --sslmode=verify-ca is essential. It's not in the upgrade documentation, but it is in the new-build documentation, step 3.
- without this parameter, it'll default to --sslmode=verify-full (postgresql docs) and require that the TLS certificate served by the primary contains the DNS record specified in SECONDARY_SLOT_NAME
All services should be up now, as the previous step starts them:
```
sudo gitlab-ctl status
```
Check the databases are running:
```
sudo gitlab-ctl tail geo-postgresql
sudo gitlab-ctl tail postgresql
```
- errors in the main postgresql log about TLS validation, and the DNS record not matching PostgreSQL indicate that replication was not set up with --sslmode=verify-ca. This process has to be completed again from the start, but this time no backup of the data directory is required. Just remove it.
Remove the hard coded PostgreSQL version from gitlab.rb - postgresql['version'] = 13
Ensure that change is applied and also fix the PostgreSQL configuration. This completes step 6 in the Geo upgrade process.
```
sudo gitlab-ctl reconfigure
```
Verify what binaries are live - it should still be the PostgreSQL 13 path.
```
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
```
Restart all processes. This completes step 7 in the Geo upgrade process. It differs in that here all processes are restarted to ensure all configuration changes are live.
```
sudo gitlab-ctl restart
```
Wait for all processes to start:
- Run top -c -o RES
- The sidekiq, puma and ee/bin/geo_log_cursor processes will consume 900mb-1.1gb once fully initialised.
- The top command will show their memory use increasing in the RES column, one this stops that process is fully started.
- Additionally, once Puma has initialised and is at peak memory use, it forks multiple puma workers.
Check Geo.
```
sudo gitlab-rake geo:status
```
- Check the main database if there's an error about the database not replicating towards the top of the report: sudo gitlab-ctl tail postgresql
Once the upgrade is verified, check in the following locations for redundant data directories. The data directory contains the live database - do not touch that. Any backups like data_12, data_12_issue7841 or data.<number> can be removed.
```
/var/opt/gitlab/geo-postgresql
/var/opt/gitlab/postgresql
```

Relevant logs

GitLab team members with Zendesk access can read more in the following customer ticket:

https://gitlab.zendesk.com/agent/tickets/411157 [1]

Details of package version

Ticket [1] 15.11.5

Configuration details

Ticket [1] configuration

per the documentation the Secondary is configured with roles(['geo_secondary_role'])

Edited Jun 14, 2023 by Michael Kozono