Fix LSN Host#caught_up?/replica_is_up_to_date? for logical replicas
What does this MR do and why?
This is a followup reimplemenation of !121208 (merged) which accounts for the issue we saw in gitlab-com/gl-infra/production#14504 (closed) .
Fix LSN Host#caught_up?/replica_is_up_to_date?
for logical replicas, with compatibility for self-managed instances where the database user does not have access to query replication origins.
These methods previously only relied on pg_last_wal_replay_lsn()
for
replicas. In a logical replica this function cannot be meaningfully
compared against the LSN from the primary because they have forked. As
such we need to make use of
remote_lsn from pg_replication_origin_status
which provides a way to
determine where the logical replication is up to with respect to LSNs
from the primary. This solution was proposed in
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1396755188
.
This also implements the suggestion proposed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1402654040 to check permissions to query replication origins, and fall back to the old query when we do not have sufficient access to retrieve the remote_lsn
.
NOTE ABOUT FEATURE FLAGS
This part of the codebase is very risky to work with as it is a very low level part of the code. Ideally we could feature flag changes like this in order to quickly rollback if it doesn't work. However since the feature flags are stored in Postgres this is very much a recursive problem and in the past we've decided not to try and use feature flags in here. An alternative we have tried in the past is to use environment variables. But environment variables don't really speed up time to mitigate incidents anyway as they still require deployments to update them. Additionally since they add extra logic they could also make bugs more likely.
So the best we can rely on is careful local testing and the fact that GSTG and GPRD are similar enough and we run QA on this on GSTG before GPRD. We also don't have realistic unit test coverage for this because we don't run replicas in any CI environment. This would again need to be caught in higher level testing situations like QA.
Screenshots or screen recordings
How to set up and validate locally
Very long setup instructions
- Follow https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md
- Update
wal_level = logical
in<gdk-root>/postgresql/data/replication.conf
- Create a non super user for testing on the
postgresql
primary:create role gitlab login;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public to gitlab;
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO gitlab;
- Create a new
postgresql-replica-pg14-leader
gdk stop
mkdir pg14 && cd pg14 && asdf local postgres 14.8
mkdir -p ../postgresql-replica-pg14-leader/data
initdb --locale=C -E utf-8 ../postgresql-replica-pg14-leader/data
pg_upgrade -b /Users/dylangriffith/.asdf/installs/postgres/13.9/bin -B /Users/dylangriffith/.asdf/installs/postgres/14.8/bin -d ../postgresql/data -D ../postgresql-replica-pg14-leader/data
postgres -D /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader/data -k /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader -h ''
- Make
postgresql-replica-pg14-leader
a logical replica:- Create a publication on the primary with
CREATE PUBLICATION logical_replication_1 FOR ALL TABLES;
- Create a subscription on
gdk psql -h /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader/
withCREATE SUBSCRIPTION logical_replication_subscription_1 CONNECTION 'host=/Users/dylangriffith/workspace/gitlab-development-kit/postgresql dbname=gitlabhq_development application_name=postgresql_replica_pg14_leader' PUBLICATION logical_replication_1 WITH (copy_data = false, create_slot = true);
- Create a publication on the primary with
- Create a physical replica
postgresql-replica-pg14-replica
ofpostgresql-replica-pg14-leader
:cd pg14 mkdir ../postgresql-replica-pg14-replica pg_basebackup -R -h /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader -D /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-replica/data -P -U gitlab_replication --wal-method=fetch
- Add permissions to read LSN views/functions in pg14:
cd pg14
psql -h /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader/ -d gitlabhq_development
create role gitlab login;
grant select on table pg_replication_origin_status to gitlab;
grant execute on function pg_show_replication_origin_status to gitlab;
select remote_lsn from pg_replication_origin_status;
- Edit your
config/database.yml
to have all the replicas in themain
config only (gitlab user and replication are only setup for main)username: gitlab load_balancing: hosts: - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql # 1 - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica # 2 - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader # 3 - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-replica # 4
- Get the
Host
objects from the rails consolelb = User.connection.load_balancer primary = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql" } replica = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql-replica" } logical_leader = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql-replica-pg14-leader" } logical_replica = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql-replica-pg14-replica" } lsn = primary.primary_write_location { 1 => primary.caught_up?(lsn), 2 => replica.caught_up?(lsn), 3 => logical_leader.caught_up?(lsn), 4 => logical_replica.caught_up?(lsn) }
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.