Rename postgres-01 to postgres-archive-replica-01 (Was: postgresql-01 production replica differs from other replicas)
The replication lag postgres-01 (replica) started to grow significantly today: https://prometheus.gprd.gitlab.net/graph?g0.range_input=1h&g0.expr=(pg_replication_lag%20%3E%2043200)%20and%20on(instance)%20(pg_replication_is_replica%7Bfqdn%3D%22postgres-01-db-gprd.c.gitlab-production.internal%22%7D%20%3D%3D%201)&g0.tab=0
@skarbek also pointed that repmgr isn't running on it.
recovery.conf
differs from the other replicas, streaming replication is not enabled, it uses WAL shipping from S3 (perhaps it was left as is since the times of the Azure->GCP migration, when postgres-01 node intentionally didn't use SR):
$ sudo cat /var/opt/gitlab/postgresql/data/recovery.conf
# recovery file for creating the standby server
# uses both restore_command to fetch wal chunks
# and pimary_conninfo to transition to secondary
# when possible
# Specifies whether to start the PostgreSQL server as a standby.
# If this parameter is on, the server will not stop recovery when the end of archived WAL is reached,
# but will keep trying to continue recovery by fetching new WAL segments using restore_command and/or
# by connecting to the primary server as specified by the primary_conninfo setting.
standby_mode = 'on'
# By default, recovery will recover to the end of the WAL log.
# So we don't need any recovery_* options
# If any option is unspecified in this string, then the corresponding environment variable (see Section 32.14) is checked.
# https://www.postgresql.org/docs/9.6/static/libpq-envars.html
# TL;DR: export PGPASSWORD=XXX
#primary_conninfo = 'user=gitlab_repmgr host=''postgres-01.db.prd.gitlab.com'' password=XXX port=5432 fallback_application_name=repmgr sslmode=prefer sslcompression=1 application_name=''postgres-01.db.gprd.gitlab.com'''
#primary_slot_name = secondary_gprd
# lastly, the restore command that will be run until we can switch
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch -p 32 "%f" "%p"'
recovery_target_timeline='latest'
Questions:
- why it is still using WAL shipping instead of streaming replication, why do we have not symmetric setup?
- why is it lagging (higher lags are possible because WAL shipping from AWS is way longer and less reliable way than SR, but this time the lag is too high)
- do we really need 5 replicas? if yes, what are reasons for that?
Edited by Nikolay Samokhvalov