pg_upgrade fails on single-node Geo secondary (GitLab 15.11 / PostgreSQL 12 -> 13)
Summary
Customer tried to follow the documented steps to perform the major upgrade from PostgreSQL 12 to PostgreSQL 13. pg_upgrade
fails having upgraded the data for geo-postgresql
and then failed to start geo-postgresql
because the PostgreSQL 12 binaries are still linked.
Predated by GitLab Geo 15.11 Database Upgrade failing (#7822 - closed) which has a different workaround.
Steps to reproduce
- Build a Geo environment with a single Omnibus node as the Geo secondary.
- Follow the documented PostgreSQL upgrade steps.
(We've not yet worked out what other conditions are required to reproduce this)
What is the current bug behavior?
pg_upgrade
upgrades geo-postgresql
from PG12 to PG13, and then fails with:
/opt/gitlab/embedded/service/omnibus-ctl/lib/postgresql.rb:26:in
`rescue in wait_for_postgresql': Timed out waiting for PostgreSQL to start (Timeout::Error)
gitlab-ctl tail geo-postgresql
shows repeated:
unrecognized configuration parameter "wal_keep_size" in
file "/var/opt/gitlab/geo-postgresql/data/runtime.conf" line 30
Manually removing this from the configuration results in the PostgreSQL binaries reading /var/opt/gitlab/geo-postgresql/data/PG_VERSION
and then instead erroring about the major version mis-match.
What is the expected correct behavior?
pg_upgrade
should not be attempting to start geo-postgresql
at this point.
It needs to first upgrade postgresql
OR relink the binaries. Or both. But it does neither.
Workaround
Click here to expand
A lot of the steps are similar to the process when a Patroni replica fails to upgrade.
Backout process is documented in a comment below
-
check: The procedure for upgrading PostgreSQL when using Geo needs to be have been followed:
- The slot name has been identified.
- You have the replication user's password.
- You've attempted
sudo gitlab-ctl pg-upgrade
and it failed withrescue in wait_for_postgresql': Timed out waiting for PostgreSQL to start (Timeout::Error)
-
check: state of
geo-postgresql
; it should be erroring withunrecognized configuration parameter
sudo gitlab-ctl tail geo-postgresql
-
check: Verify that the primary database is upgraded
sudo egrep '^[0-9]' /var/opt/gitlab/postgresql/data/PG_VERSION
Result confirms it's upgraded, as follows. No other steps are required on the primary.
/var/opt/gitlab/postgresql/data/PG_VERSION:13
Verify that the two databases on the secondary are different releases:
sudo egrep '^[0-9]' /var/opt/gitlab/geo-postgresql/data/PG_VERSION /var/opt/gitlab/postgresql/data/PG_VERSION
The result must be that
geo-postgresql
has upgraded butpostgresql
has not./var/opt/gitlab/geo-postgresql/data/PG_VERSION:13 /var/opt/gitlab/postgresql/data/PG_VERSION:12
The process only fixes the
postgresql
service, by initializing it from scratch and replicating it from the primary. This can't be done withgeo-postgresql
and this process is not suitable for that. This must show as upgraded already.If you run
sudo gitlab-ctl pg-upgrade
a second time, it rolls backgeo-postgresql
to PostgreSQL 12; this is one way thatgeo-postgresql
would be in the wrong state for this workaround. -
check: Verify what binaries are live
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
Check the symbolic link paths: the majority should be
/opt/gitlab/embedded/postgresql/12
. There maybe both later an earlier versions for one or two, but the majority should be one version. -
workaround commences here. Please ensure the system matches the previous checks/pre-requsites before proceeding. If you ran
sudo gitlab-ctl pg-upgrade
more than once, it's likely that it will be reverted the system to PostgreSQL 12.== Reverted to 12.12. Please check output for what went wrong ==
Re-run
pg-upgrade
again to upgrade thegeo-logcursor
database and then run through the previous steps to confirm the system is now in the correct state. The workaround does not upgrade thegeo-logcursor
database. -
Stop all services.
gitlab-ctl reconfigure
takes different code paths when the PostgreSQL services are running.sudo gitlab-ctl stop
-
Add
postgresql['version'] = 13
togitlab.rb
-
Relink the binaries to version 13.
sudo gitlab-ctl reconfigure
-
Verify what binaries are live - it should now be the PostgreSQL 13 path.
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
-
Rename the
postgresql
data directory.data_12
is not used to avoid interfering with normal Omnibus upgrade/downgrade procedures.sudo mv /var/opt/gitlab/postgresql/data /var/opt/gitlab/postgresql/data_12_issue7841
-
Recreate the
data
directory. This will also start uppostgresql
. It should be the only thing running.sudo gitlab-ctl reconfigure sudo gitlab-ctl status
-
Re-initialize replication
sudo gitlab-ctl replicate-geo-database --slot-name=SECONDARY_SLOT_NAME \ --host=PRIMARY_HOST_NAME --sslmode=verify-ca
- This completes step 5 in the Geo upgrade process. It may take some time; it does a full initial replication of the whole database.
-
--sslmode=verify-ca
is essential. It's not in the upgrade documentation, but it is in the new-build documentation, step 3. - without this parameter, it'll default to
--sslmode=verify-full
(postgresql docs) and require that the TLS certificate served by the primary contains the DNS record specified inSECONDARY_SLOT_NAME
-
All services should be up now, as the previous step starts them:
sudo gitlab-ctl status
-
Check the databases are running:
sudo gitlab-ctl tail geo-postgresql sudo gitlab-ctl tail postgresql
- errors in the main
postgresql
log about TLS validation, and the DNS record not matchingPostgreSQL
indicate that replication was not set up with--sslmode=verify-ca
. This process has to be completed again from the start, but this time no backup of thedata
directory is required. Just remove it.
- errors in the main
-
Remove the hard coded PostgreSQL version from
gitlab.rb
-postgresql['version'] = 13
-
Ensure that change is applied and also fix the PostgreSQL configuration. This completes step 6 in the Geo upgrade process.
sudo gitlab-ctl reconfigure
-
Verify what binaries are live - it should still be the PostgreSQL 13 path.
sudo ls -al /opt/gitlab/embedded/bin | grep postgres
-
Restart all processes. This completes step 7 in the Geo upgrade process. It differs in that here all processes are restarted to ensure all configuration changes are live.
sudo gitlab-ctl restart
Wait for all processes to start:
- Run
top -c -o RES
- The
sidekiq
,puma
andee/bin/geo_log_cursor
processes will consume 900mb-1.1gb once fully initialised. - The top command will show their memory use increasing in the
RES
column, one this stops that process is fully started. - Additionally, once Puma has initialised and is at peak memory use, it forks multiple puma workers.
- Run
-
Check Geo.
sudo gitlab-rake geo:status
- Check the main database if there's an error about the database not replicating towards the top of the report:
sudo gitlab-ctl tail postgresql
- Check the main database if there's an error about the database not replicating towards the top of the report:
-
Once the upgrade is verified, check in the following locations for redundant
data
directories. Thedata
directory contains the live database - do not touch that. Any backups likedata_12
,data_12_issue7841
ordata.<number>
can be removed./var/opt/gitlab/geo-postgresql /var/opt/gitlab/postgresql
Relevant logs
GitLab team members with Zendesk access can read more in the following customer ticket:
Details of package version
Ticket [1] 15.11.5
Configuration details
Ticket [1] configuration
per the documentation the Secondary is configured with roles(['geo_secondary_role'])