Investigate why upgrade from PostgreSQL 16.8 to 16.10 in Patroni cluster got stuck

A customer upgrading from GitLab 18.2.2 to 18.5.1 hit this problem:

The initial reconfigure after installing the 18.5.1 .deb package hung on the restart of postgresql and required a ctrl c to stop the reconfigure.

INFO: execute[signal to restart postgresql if running version is less than installed one] sending run action to ruby_block[wait for node bootstrap to complete] (before)
  * ruby_block[wait for node bootstrap to complete] action run[2025-11-04T19:33:12+00:00] INFO: ruby_block[wait for node bootstrap to complete] called

    - execute the ruby block wait for node bootstrap to complete
  * execute[signal to restart postgresql if running version is less than installed one] action run

As a result, the next gitlab-ctl reconfigure failed:

$ sudo gitlab-ctl reconfigure[2025-11-04T20:01:42+00:00] INFO: Running queued delayed notifications before re-raising exception
[2025-11-04T20:01:42+00:00] INFO: templatesymlink[Create a gitlab.yml and create a symlink to Rails root] sending run action to execute[clear the gitlab-rails cache] (delayed)
  Recipe: gitlab::gitlab-rails
    * execute[clear the gitlab-rails cache] action run (skipped due to not_if)
[2025-11-04T20:01:42+00:00] INFO: file[/var/opt/gitlab/postgresql/data/pg_ident.conf] sending run action to execute[reload postgresql] (delayed)
  Recipe: patroni::enable
   (skipped due to only_if)

  Running handlers:
[2025-11-04T20:01:42+00:00] ERROR: Running exception handlers
There was an error running gitlab-ctl reconfigure:

exit

At this time, the patroni replica we were attempting to upgrade was in a pending_restart state with a high Lag and logs were indicating FATAL: could not receive data from WAL stream:

So, we verified the postgres version was updated to 16.10 and tried to reinitialize the patroni replica. The gitlab-ctl reinitialize-replica command failed with a similar error to this - Error while reinitializing replica on the current node: Attributes not found in /opt/gitlab/embedded/nodes/<HOSTNAME>.json, has reconfigure been run yet?

Sure enough, the nodes file just contained the hostname. So, we followed the steps here to recover: https://docs.gitlab.com/administration/postgresql/replication_and_failover_troubleshooting/#errors-running-gitlab-ctl

The same thing happened on the second replica. I was monitoring the nodes file and the initial reconfigure after the upgrade is when the nodes file was updated causing these issues. We followed the steps to basically rebuild the replica.

Once that was done, we performed a gitlab-ctl patroni switchover on the leader and had to do the same steps there.

Edited by Stan Hu