Zero downtime upgrade should restart Patroni before Consul to allow for a quicker failover
Summary
During zero-downtime upgrades, part of our Postgres main task is to restart all services in each Postgres node. This is done by running gitlab-ctl restart, which restarts all services in alphabetical order.
We found out that if the consul service is restarted immediately before patroni, the failover procedure is affected, resulting in several seconds of a leaderless Patroni cluster, which can lead to availability outage, and is undesirable in a procedure that strives to keep downtime to a minimum.
Current Behavior
The leader is shutdown at 2025-01-13_07:33:24:
2025-01-13_07:33:24.16447 LOG: received fast shutdown request
2025-01-13_07:33:27.23412 2025-01-13 07:33:27,232 WARNING: Postgresql is not running.
2025-01-13_07:33:27.23667 2025-01-13 07:33:27,233 INFO: Lock owner: None; I am node338.node.xxx
2025-01-13_07:33:27.25894 2025-01-13 07:33:27,258 INFO: starting as a secondary
2025-01-13_07:33:27.26231 2025-01-13 07:33:27,261 INFO: Lock owner: None; I am node338.node.xxx
2025-01-13_07:33:27.26399 2025-01-13 07:33:27,261 INFO: not healthy enough for leader race
2025-01-13_07:33:27.27219 2025-01-13 07:33:27,271 INFO: restarting after failure in progress
2025-01-13_07:33:27.39571 2025-01-13 07:33:27,395 INFO: postmaster pid=1392047
Only replicas are available while the leader node (node338) is restarted:
2025-01-13_07:33:28.43552 2025-01-13 07:33:28,435 INFO: establishing a new patroni connection to the postgres cluster
2025-01-13_07:33:28.46882 2025-01-13 07:33:28,468 INFO: Got response from node339.node.xxx http://xx.xxx.xx.151:8008/patroni: {"state": "running", "postmaster_start_time": "2025-01-03 10:58:28.321549+00:00", "role": "replica", "server_version": 140011, "cluster_unlocked": true, "xlog": {"received_location": 32031189298848, "replayed_location": 32031189298848, "replayed_timestamp": "2025-01-13 07:33:20.146857+00:00", "paused": false}, "timeline": 6, "database_system_identifier": "7357417423373317263", "pending_restart": true, "patroni": {"version": "2.1.0", "scope": "postgresql-ha"}}
2025-01-13_07:33:28.46932 2025-01-13 07:33:28,469 INFO: Got response from node340.node.xxx http://xx.xxx.xx.123:8008/patroni: {"state": "running", "postmaster_start_time": "2025-01-03 10:58:28.132146+00:00", "role": "replica", "server_version": 140011, "cluster_unlocked": true, "xlog": {"received_location": 32031189298680, "replayed_location": 32031189298680, "replayed_timestamp": "2025-01-13 07:33:20.124143+00:00", "paused": false}, "timeline": 6, "database_system_identifier": "7357417423373317263", "pending_restart": true, "patroni": {"version": "2.1.0", "scope": "postgresql-ha"}}
2025-01-13_07:33:28.61345 LOG: database system is ready to accept connections
2025-01-13_07:33:29.59980 2025-01-13 07:33:29,574 INFO: Lock owner: node338.node.xxx; I am node338.node.xxx
2025-01-13_07:33:29.59983 2025-01-13 07:33:29,599 INFO: Register service postgresql-ha, params {'service_id': 'postgresql-ha/node338.node.xxx', 'address': '10.162.250.194', 'port': 5432, 'check': {'http': 'http://xx.xxx.xx.194:8008/master', 'interval': '10s', 'DeregisterCriticalServiceAfter': '150.0s'}, 'tags': ['master']}
Expected Behavior
Our goal should be to avoid outages, downtime and reduce the amount of 5xx errors as much as possible. The PostgresqlHA cluster should be always online and not cause errors.
If we first restart Patroni, a new leader is quickly elected, allowing consul and other services to be safely restarted afterward without triggering any downtime, or "leaderless" cluster activity.
Tests
As proposed by Grant here, we've tested running the following commands in succession, which solve the problem:
sudo gitlab-ctl restart patroni
gitlab-ctl restart consul node-exporter
500 errors seen during upgrade
Inspecting the Grafana dashboard in both cases shows a high peak of 5xx errors in test case 1, and only a small amount in test case 2, as can be seen on the screenshot below.
- Test 1@11:37
- Test 2@12:05
