PostgreSQL replication didn't work with Patroni as loopback was required in trust_auth_cidr_addresses

Summary

Errors after restarting the Patroni nodes, broadly:

FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator"

Which I conclude is the reason for the other error:

FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/41803F0

I don't think the cluster was actually replicating prior to the restart, as the these errors were already occurring. Resolved by reinitializing the secondaries

Logging this as an issue in case these arise outside of our testing.

Fix

in gitlab.rb add 127.0.0.1/32 to postgresql['trust_auth_cidr_addresses']

postgresql['trust_auth_cidr_addresses'] = ['172.18.0.0/24','127.0.0.1/32']
# or
postgresql['trust_auth_cidr_addresses'] = %w(172.18.0.0/24 127.0.0.1/32)

then run gitlab-ctl reconfigure

Steps to reproduce

I have a Patroni cluster running in Docker using the Omnibus images.

This assembled fine, and then after restarting the three Patroni containers, I found that replication wasn't working.

for i in 1 2 3 ; do docker restart patroni$i ; done
# curl -vs  http://172.18.0.112:8008/cluster | jq .
* About to connect() to 172.18.0.112 port 8008 (#0)
*   Trying 172.18.0.112...
* Connected to 172.18.0.112 (172.18.0.112) port 8008 (#0)
> GET /cluster HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 172.18.0.112:8008
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: BaseHTTP/0.6 Python/3.7.10
< Date: Tue, 08 Jun 2021 07:57:40 GMT
< Content-Type: application/json
< 
{ [data not shown]
* Closing connection 0
{
  "members": [
    {
      "name": "patroni1",
      "role": "leader",
      "state": "running",
      "api_url": "http://172.18.0.111:8008/patroni",
      "host": "172.18.0.111",
      "port": 5432,
      "timeline": 2
    },
    {
      "name": "patroni2",
      "role": "replica",
      "state": "running",
      "api_url": "http://172.18.0.112:8008/patroni",
      "host": "172.18.0.112",
      "port": 5432,
      "lag": 56
    },
    {
      "name": "patroni3",
      "role": "replica",
      "state": "running",
      "api_url": "http://172.18.0.113:8008/patroni",
      "host": "172.18.0.113",
      "port": 5432,
      "lag": 0
    }
  ]
}

The leader has not changed, and it was the leader at shutdown:

2021-06-08 07:39:13,765 INFO: Lock owner: patroni1; I am patroni1
2021-06-08 07:39:13,768 INFO: no action.  i am the leader with the lock
2021-06-08 07:39:23,777 INFO: Lock owner: patroni1; I am patroni1
2021-06-08 07:39:23,799 INFO: no action.  i am the leader with the lock
LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating connection due to administrator command
LOG:  background worker "logical replication launcher" (PID 350) exited with exit code 1
LOG:  shutting down
LOG:  database system is shut down

LOG:  started streaming WAL from primary at 0/5000000 on timeline 2
FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/4000388
FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator", SSL on
FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator", SSL off
2021-06-08 07:46:06,740 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/opt/gitlab/embedded/lib/python3.7/site-packages/patroni/postgresql/__init__.py", line 737, in get_replica_timeline
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/opt/gitlab/embedded/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/opt/gitlab/embedded/lib/python3.7/site-packages/patroni/postgresql/__init__.py", line 732, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/opt/gitlab/embedded/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/opt/gitlab/embedded/lib/python3.7/site-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
    with psycopg2.connect(**kwargs) as conn:
  File "/opt/gitlab/embedded/lib/python3.7/site-packages/psycopg2-2.8.6-py3.7-linux-x86_64.egg/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator", SSL on
FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator", SSL off

FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1",

For Patroni, it is not documented that the loopback is required, unlike repmgr. This came up in another issue

Tried adding it to all three:

    - update content in file /var/opt/gitlab/postgresql/data/pg_hba.conf from 94a1d6 to 3a7100
    --- /var/opt/gitlab/postgresql/data/pg_hba.conf	2021-06-06 13:56:20.585654101 +0000
    +++ /var/opt/gitlab/postgresql/data/.chef-pg_hba20210608-2707-1is63n6.conf	2021-06-08 08:10:56.828252334 +0000
    @@ -72,6 +72,8 @@
     
     host    all         all         172.18.0.0/24           trust
     host    replication gitlab_replicator 172.18.0.0/24     trust
    +host    all         all         127.0.0.1/32           trust
    +host    replication gitlab_replicator 127.0.0.1/32     trust

After, the traceback and the fatal errors about the loopback are gone.

2021-06-08 08:13:30,525 INFO: Lock owner: patroni1; I am patroni3
2021-06-08 08:13:30,525 INFO: does not have lock
2021-06-08 08:13:30,533 INFO: no action.  i am a secondary and i am following a leader
LOG:  started streaming WAL from primary at 0/5000000 on timeline 2
FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/41803F0
LOG:  started streaming WAL from primary at 0/5000000 on timeline 2
FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/41803F0
2021-06-08 08:13:40,613 INFO: Lock owner: patroni1; I am patroni3
2021-06-08 08:13:40,614 INFO: does not have lock
2021-06-08 08:13:40,627 INFO: no action.  i am a secondary and i am following a leader

What is the current bug behavior?

Is the loopback required in trusted CIDR range when running Patroni?

I thought I did everything in the implementation instructions, but nothing alerted me to te cluster not being synced.

What is the expected correct behavior?

No errors about the loopback

Cluster build instructions include a step that would highlight that the cluster isn't actually in sync. If there's no writes to the database, it seems the 'lag' field isn't helpful.

# gitlab-ctl patroni members help
+ Cluster: postgresql-ha (6970678148837286213) ------+---------+---------+----+-----------+
| Member                              | Host         | Role    | State   | TL | Lag in MB |
+-------------------------------------+--------------+---------+---------+----+-----------+
| gitlab-testdbc1-patroni1.watertower | 172.18.0.111 | Leader  | running |  2 |           |
| gitlab-testdbc1-patroni2.watertower | 172.18.0.112 | Replica | running |  2 |         0 |
| gitlab-testdbc1-patroni3.watertower | 172.18.0.113 | Replica | running |  2 |         0 |
+-------------------------------------+--------------+---------+---------+----+-----------+

Relevant logs

as above

Details of package version

# docker ps | egrep 'IMAGE|patroni' | cut -b -50
CONTAINER ID        IMAGE                         
641b49c5075f        gitlab/gitlab-ee:13.12.2-ee.0 
75fc2fbb5b80        gitlab/gitlab-ee:13.12.2-ee.0 
954abe2cf779        gitlab/gitlab-ee:13.12.2-ee.0

Environment details

gitlab/gitlab-ee:13.12.2-ee.0 container (Ubuntu) running in docker on Centos 7.

Configuration details

Provide the relevant sections of `/etc/gitlab/gitlab.rb`

# grep -v password  patroni*/config/gitlab.rb 
patroni1/config/gitlab.rb:roles ['postgres_role']
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:consul['configuration'] = {
patroni1/config/gitlab.rb:  retry_join: ['172.18.0.101:8301','172.18.0.102:8301','172.18.0.103:8301'],
patroni1/config/gitlab.rb:  server: false,
patroni1/config/gitlab.rb:}
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:# Enable Patroni (which automatically disables Repmgr).
patroni1/config/gitlab.rb:patroni['enable'] = true
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:# PostgreSQL configuration
patroni1/config/gitlab.rb:postgresql['listen_address'] = '0.0.0.0'
patroni1/config/gitlab.rb:postgresql['trust_auth_cidr_addresses'] = ['172.18.0.0/24','127.0.0.1/32']
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:# Configure the Consul agent
patroni1/config/gitlab.rb:consul['services'] = ['postgresql']
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:# to prevent replication from using up all of the available database connections
patroni1/config/gitlab.rb:# set to one more than the number of PG servers
patroni1/config/gitlab.rb:patroni['postgresql']['max_wal_senders'] = 4
patroni1/config/gitlab.rb:patroni['postgresql']['max_replication_slots'] = 4
patroni1/config/gitlab.rb:
patroni1/config/gitlab.rb:# disable prometheus
patroni1/config/gitlab.rb:prometheus_monitoring['enable'] = false
patroni1/config/gitlab.rb:# logrotate not needed, sv handles consul
patroni1/config/gitlab.rb:logrotate['enable'] = false
patroni1/config/gitlab.rb:# Disable auto migrations
patroni1/config/gitlab.rb:gitlab_rails['auto_migrate'] = false
#--
patroni2/config/gitlab.rb:roles ['postgres_role']
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:consul['configuration'] = {
patroni2/config/gitlab.rb:  retry_join: ['172.18.0.101:8301','172.18.0.102:8301','172.18.0.103:8301'],
patroni2/config/gitlab.rb:  server: false,
patroni2/config/gitlab.rb:}
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:# Enable Patroni (which automatically disables Repmgr).
patroni2/config/gitlab.rb:patroni['enable'] = true
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:# PostgreSQL configuration
patroni2/config/gitlab.rb:postgresql['listen_address'] = '0.0.0.0'
patroni2/config/gitlab.rb:postgresql['trust_auth_cidr_addresses'] = ['172.18.0.0/24','127.0.0.1/32']
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:# Configure the Consul agent
patroni2/config/gitlab.rb:consul['services'] = ['postgresql']
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:# to prevent replication from using up all of the available database connections
patroni2/config/gitlab.rb:# set to one more than the number of PG servers
patroni2/config/gitlab.rb:patroni['postgresql']['max_wal_senders'] = 4
patroni2/config/gitlab.rb:patroni['postgresql']['max_replication_slots'] = 4
patroni2/config/gitlab.rb:
patroni2/config/gitlab.rb:# disable prometheus
patroni2/config/gitlab.rb:prometheus_monitoring['enable'] = false
patroni2/config/gitlab.rb:# logrotate not needed, sv handles consul
patroni2/config/gitlab.rb:logrotate['enable'] = false
patroni2/config/gitlab.rb:# Disable auto migrations
patroni2/config/gitlab.rb:gitlab_rails['auto_migrate'] = false
#--
patroni3/config/gitlab.rb:roles ['postgres_role']
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:consul['configuration'] = {
patroni3/config/gitlab.rb:  retry_join: ['172.18.0.101:8301','172.18.0.102:8301','172.18.0.103:8301'],
patroni3/config/gitlab.rb:  server: false,
patroni3/config/gitlab.rb:}
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:# Enable Patroni (which automatically disables Repmgr).
patroni3/config/gitlab.rb:patroni['enable'] = true
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:# PostgreSQL configuration
patroni3/config/gitlab.rb:postgresql['listen_address'] = '0.0.0.0'
patroni3/config/gitlab.rb:postgresql['trust_auth_cidr_addresses'] = ['172.18.0.0/24','127.0.0.1/32']
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:# Configure the Consul agent
patroni3/config/gitlab.rb:consul['services'] = ['postgresql']
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:# to prevent replication from using up all of the available database connections
patroni3/config/gitlab.rb:# set to one more than the number of PG servers
patroni3/config/gitlab.rb:patroni['postgresql']['max_wal_senders'] = 4
patroni3/config/gitlab.rb:patroni['postgresql']['max_replication_slots'] = 4
patroni3/config/gitlab.rb:
patroni3/config/gitlab.rb:# disable prometheus
patroni3/config/gitlab.rb:prometheus_monitoring['enable'] = false
patroni3/config/gitlab.rb:# logrotate not needed, sv handles consul
patroni3/config/gitlab.rb:logrotate['enable'] = false
patroni3/config/gitlab.rb:# Disable auto migrations
patroni3/config/gitlab.rb:gitlab_rails['auto_migrate'] = false

Edited Jun 15, 2021 by Ben Prescott_