Allow to specify member name for `patroni reinitialize-replica` command
When you have a node that fails to boot up because of system ID mismatch
, one way to fix this is by issuing a reinit
command to that node via patroni
.
We currently pack that command via gitlab-ctl patroni reinitialize-replica
. The problem is that when there is a system ID mismatch
patroni will not boot up the API on the problematic node. That means we need to use any healthy one to trigger the reinitialization process.
Our gitlab-ctl patroni
doesn't expose the members option, so it always use the current machine member name when delegating to reinit
. We need to add a --member
flag to allow overriding that.
Logs
Here is how it looks like when patroni has the system ID mistmatch
problem:
$ gitlab-ctl tail patroni
2021-01-14_18:31:07.87645 2021-01-14 18:31:07,876 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-01-14_18:31:07.88011 2021-01-14 18:31:07,879 INFO: establishing a new patroni connection to the postgres cluster
2021-01-14_18:31:07.89219 2021-01-14 18:31:07,891 CRITICAL: system ID mismatch, node gabriel-patroni-primary-geo-patroni-2.c.group-geo-f9c951.internal belongs to a different cluster: 6915205339950380026 != 6915204984830566901
2021-01-14_18:31:07.90700 2021-01-14 18:31:07,906 WARNING: Could not register service: unknown role type promoted
2021-01-14_18:31:08.44187 2021-01-14 18:31:08,441 ERROR: PostgreSQL shutdown failed, leader key not removed.
from the healthy patroni node:
$ gitlab-ctl patroni members
+ Cluster: postgresql-ha (6915205339950380026) ---------------------+-------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------------------------------------------------------------+-------------+---------+---------+----+-----------+
| gabriel-patroni-primary-geo-patroni-2.c.group-geo-f9c951.internal | 10.164.0.58 | Replica | running | 1 | 19 |
| gabriel-patroni-primary-geo-patroni-3.c.group-geo-f9c951.internal | 10.164.0.59 | Replica | running | 1 | 19 |
| gabriel-patroni-primary-geo.c.group-geo-f9c951.internal | 10.164.0.55 | Leader | running | 2 | |
+-------------------------------------------------------------------+-------------+---------+---------+----+-----------+
(nodes with smaller TL
and any Lag
are the problematic ones that can be fixed by reinitialize-replica` command).
Workaround
Current workaround is to call patroni directly with: /opt/gitlab/embedded/bin/patronictl -c /var/opt/gitlab/patroni/patroni.yaml reinit postgresql-ha
It will ask you which node you want to reinitialize. Optionally you can specify it as part of the command. For ex: /opt/gitlab/embedded/bin/patronictl -c /var/opt/gitlab/patroni/patroni.yaml reinit postgresql-ha gabriel-patroni-primary-geo-patroni-2.c.group-geo-f9c951.internal
Proposal
Add --member
to specificy a custom member name instead of using current machine one.