Geo: gitlab-ctl pg-upgrade failed to upgrade geo-postgresql and left main database on 14 while geo-posgresql in 13
Summary
A gitlab-ctl pg-upgrade
when executed on a secondary site will perform the upgrade in two steps:
- It will try to upgrade the main database (and also take patroni into consideration)
- If the upgrade succeeds it will proceed trying to upgrade geo-postgresql
In case step 1 fails, it will perform a rollback and everything will go back to the previous version (13)
When steps 2 fails, we get into an inconsistent state, see the logs in the log section.
When checking whether the main database was upgraded:
root@gabriel-geo:~# gitlab-psql
psql (13.12, server 14.9)
WARNING: psql major version 13, server major version 14.
Some psql features might not work.
We get the information it is running on 14.9
but using psql 13.12
binaries.
At this point, the symlinks points to 13.12
but the server is running with 14.9
.
If a gitlab-ctl reconfigure
is run, it will update the symlinks to point to 14.9 as thats whats the main database is running:
root@gabriel-geo:~# gitlab-psql
psql (14.9)
Type "help" for help.
Checking the geo-postgresql
data we get the following:
root@gabriel-geo:/var/opt/gitlab/geo-postgresql# cat VERSION
postgres (PostgreSQL) 14.9
root@gabriel-geo:/var/opt/gitlab/geo-postgresql# cat data/PG_VERSION
13
This puts the system in an inconsistent state.
In addition, it seems the pg-upgrade script only checks for the main database version:
When it should check as separate steps the main and the geo-postgresql
.
Because they both uses the same symlinked binaries, we have very little alternative on keeping both running when they are in different versions.
Perhaps we should always force the database binaries to use the realpath (not the symlinked one) while keeping the symlinked to the latest installed version as a convenience to the system administrator.
Without using the realpath, in a failed upgrade like that, we get into an emergency situation where the tracking database can't start:
root@gabriel-geo:/var/opt/gitlab/geo-postgresql# gitlab-ctl tail geo-postgresql
==> /var/log/gitlab/geo-postgresql/current <==
2024-05-15_12:55:05.99872 FATAL: database files are incompatible with server
2024-05-15_12:55:05.99874 DETAIL: The data directory was initialized by PostgreSQL version 13, which is not compatible with this version 14.9.
Steps to reproduce
?
What is the current bug behavior?
A failure to upgrade the geo tracking database, didn't reverted back to a good state.
In addition, the upgrade script only checks the main database, which makes it impossible to just retry the tracking one
In addition, the upgrade script checks the pg_ctl for the version instead of looking at the data/pg_version
, which is the SSOT for what version the database data is set to.
What is the expected correct behavior?
We have a couple of alternatives:
- When geo-posgresql failed to upgrade, we should have reverted the main
postgresql
back
This has a drawback that the main database is usually much bigger then the tracking one, so retrying an upgrade can be very frustrating as you have to do the first one that is known to work first until you figure out what is wrong with the tracking one.
-
Update the upgrade script to verify the
data/pg_version
instead to see which database has been upgraded, and check them both (so a retry on geo-postgresql could be possible) -
We should use the realpath binaries to run the database instead of relying on the symlinked ones. That could allow the intermediate state where the main database is upgraded and the tracking one isn't, but is rolled back to a previous working state. (There is no major problem running them in different versions for a shortwhile)
Relevant logs
Relevant logs
... Running reconfigure: OK Restarting Patroni on this node :+ Cluster: postgresql-ha (7369195536400823741) -----------+----------------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------------------------------+---------------+----------------+---------+----+-----------+ | gabriel-geo....internal | 10... | Standby Leader | running | 3 | | +-----------------------------------------+---------------+----------------+---------+----+-----------+ Failed: restart for member gabriel-geo.....internal, status code=403, (Access is denied) Restarting Patroni on this node : OK Waiting for Database to be running. ==== Upgrade has completed ==== Please verify everything is working and run the following if so sudo rm -rf /var/opt/gitlab/postgresql/data.13 sudo rm -f /var/opt/gitlab/postgresql-version.oldUpgrading the geo-postgresql database Toggling deploy page:cp /opt/gitlab/embedded/service/gitlab-rails/public/deploy.html /opt/gitlab/embedded/service/gitlab-rails/public/index.html Toggling deploy page: OK Toggling services:ok: down: alertmanager: 0s, normally up ok: down: crond: 0s, normally up ok: down: geo-logcursor: 0s, normally up ok: down: gitaly: 1s, normally up ok: down: gitlab-exporter: 1s, normally up ok: down: gitlab-kas: 0s, normally up ok: down: logrotate: 0s, normally up ok: down: node-exporter: 1s, normally up ok: down: postgres-exporter: 0s, normally up ok: down: prometheus: 1s, normally up ok: down: redis-exporter: 0s, normally up ok: down: registry: 0s, normally up ok: down: sidekiq: 0s, normally up Toggling services: OK There was an error fetching locale and encoding information from the database Please ensure the database is running and functional before running pg-upgrade STDOUT: STDERR: psql: error: connection to server on socket "/var/opt/gitlab/geo-postgresql/.s.PGSQL.5431" failed: No such file or directory Is the server running locally and accepting connections on that socket? == Fatal error == Please check error logs == Reverting == ok: down: geo-postgresql: 1s, normally up, want up Symlink correct version of binaries: OK ok: run: geo-postgresql: (pid 9723) 0s == Reverted == == Reverted to 13.12. Please check output for what went wrong ==
Details of package version
Provide the package version installation details
ii gitlab-ee 16.8.1-ee.0 amd64 GitLab Enterprise Edition (including NGINX, Postgres, Redis)
Environment details
- Operating System:
REPLACE-WITH-DETAILS
- Installation Target, remove incorrect values:
- Bare Metal Machine
- VM: Digital Ocean, AWS, GCP, Azure, Other
REPLACE-WITH-DETAILS
- Other:
REPLACE-WITH-DETAILS
- Installation Type, remove incorrect values:
- New Installation
- Upgrade from version
REPLACE-WITH-DETAILS
- Other:
REPLACE-WITH-DETAILS
- Is there any other software running on the machine:
REPLACE-WITH-DETAILS
- Is this a single or multiple node installation?
- Resources
- CPU:
REPLACE-WITH-DETAILS
- Memory total:
REPLACE-WITH-DETAILS
- CPU: