Prevent "Pending restart" on the patroni postgresql cluster at the secondary GEO site

Dear GET team,

Recently we have successfully constructed a GEO setup with 3K architecture on primary and secondary site.

After deploying GEO sites, we have found out that patroni cluster on primary site is healthy and operational but patroni cluster at secondary site could not restart and stay on "pending restart". And

secondary-site:postgresql-1# gitlab-ctl patroni members
+ Cluster: postgresql-ha ----+----------------+---------+----+-----------+-----------------+
| Member         | Host      | Role           | State   | TL | Lag in MB | Pending restart |
+----------------+-----------+----------------+---------+----+-----------+-----------------+
| postgres-1     | 10.7.0.21 | Standby Leader | running | 40 |         0 | *               |
| postgres-2     | 10.7.0.22 | Replica        | running | 40 |           | *               |
| postgres-3     | 10.7.0.23 | Replica        | running | 40 |         0 | *               |
+----------------+-----------+----------------+---------+----+-----------+-----------------+

The reason for this "pending restart" is inconsistent parameter for replication slots between primary and secondary site

In the primary cluster, we have got following parameter

  • "max_wal_senders": 8
  • "max_replication_slots": 8

through the geo file path ansible/roles/gitlab_geo/templates/primary-database.rb.j2

  • patroni['postgresql']['max_replication_slots'] = {{ ((geo_primary_postgres_count | int) + (geo_secondary_site_count | int)) * 2 }}
  • patroni['postgresql']['max_wal_senders'] = {{ ((geo_primary_postgres_count | int) + (geo_secondary_site_count | int)) * 2 }}

In the standby cluster, we have the another value for same parameter:

  • "max_wal_senders": 7
  • "max_replication_slots": 6

through the standard postgresql file path ansible/roles/postgres/templates/postgres.gitlab.rb.j2

  • patroni['postgresql']['max_replication_slots'] = {{ ((groups['postgres'] | length) * 2) }}
  • patroni['postgresql']['max_wal_senders'] = {{ ((groups['postgres'] | length) * 2) + 1 }}

If we set same/consistent value of replication slots for both sites,

  • "max_wal_senders": 8
  • "max_replication_slots": 8

then "pending restart" issue can be resolved directly.

+ Cluster: postgresql-ha ----+----------------+---------+----+-----------+
| Member         | Host      | Role           | State   | TL | Lag in MB |
+----------------+-----------+----------------+---------+----+-----------+
| postgres-1     | 10.7.0.21 | Standby Leader | running | 40 |         0 |
| postgres-2     | 10.7.0.22 | Replica        | running | 40 |           |
| postgres-3     | 10.7.0.23 | Replica        | running | 40 |         0 |
+----------------+-----------+----------------+---------+----+-----------+

To avoid such an inconsistency issue at the patroni cluster setup, either parameters must be customized on the secondary site

  • patroni['postgresql']['max_replication_slots'] = 8
  • patroni['postgresql']['max_wal_senders'] = 8

or one can set up the correct values ​at the secodnary site as well, when constructing a GEO secondary database, as similar as it has been done at the primary site.

/etc/gitlab/gitlab.geo.secondary-database.rb

  • patroni['postgresql']['max_replication_slots'] = {{ ((geo_primary_postgres_count | int) + (geo_secondary_site_count | int)) * 2 }}
  • patroni['postgresql']['max_wal_senders'] = {{ ((geo_primary_postgres_count | int) + (geo_secondary_site_count | int)) * 2 }}

I believe this logical approach will be far more rational than hard-coding the target value. Should it be feasible near in future?

Best regards, Donghee

Edited by Donghee Kang