Skip to content

Fix LSN Host#caught_up?/replica_is_up_to_date? for logical replicas

Matt Kasa requested to merge lsn-replica-lag-fix into master

What does this MR do and why?

This is a followup reimplemenation of !121208 (merged) which accounts for the issue we saw in gitlab-com/gl-infra/production#14504 (closed) .

Fix LSN Host#caught_up?/replica_is_up_to_date? for logical replicas, with compatibility for self-managed instances where the database user does not have access to query replication origins.

These methods previously only relied on pg_last_wal_replay_lsn() for replicas. In a logical replica this function cannot be meaningfully compared against the LSN from the primary because they have forked. As such we need to make use of remote_lsn from pg_replication_origin_status which provides a way to determine where the logical replication is up to with respect to LSNs from the primary. This solution was proposed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1396755188 .

This also implements the suggestion proposed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1402654040 to check permissions to query replication origins, and fall back to the old query when we do not have sufficient access to retrieve the remote_lsn.

NOTE ABOUT FEATURE FLAGS

This part of the codebase is very risky to work with as it is a very low level part of the code. Ideally we could feature flag changes like this in order to quickly rollback if it doesn't work. However since the feature flags are stored in Postgres this is very much a recursive problem and in the past we've decided not to try and use feature flags in here. An alternative we have tried in the past is to use environment variables. But environment variables don't really speed up time to mitigate incidents anyway as they still require deployments to update them. Additionally since they add extra logic they could also make bugs more likely.

So the best we can rely on is careful local testing and the fact that GSTG and GPRD are similar enough and we run QA on this on GSTG before GPRD. We also don't have realistic unit test coverage for this because we don't run replicas in any CI environment. This would again need to be caught in higher level testing situations like QA.

Screenshots or screen recordings

How to set up and validate locally

Very long setup instructions
  1. Follow https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md
  2. Update wal_level = logical in <gdk-root>/postgresql/data/replication.conf
  3. Create a non super user for testing on the postgresql primary:
    1. create role gitlab login;
    2. GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public to gitlab;
    3. GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO gitlab;
  4. Create a new postgresql-replica-pg14-leader
    1. gdk stop
    2. mkdir pg14 && cd pg14 && asdf local postgres 14.8
    3. mkdir -p ../postgresql-replica-pg14-leader/data
    4. initdb --locale=C -E utf-8 ../postgresql-replica-pg14-leader/data
    5. pg_upgrade -b /Users/dylangriffith/.asdf/installs/postgres/13.9/bin -B /Users/dylangriffith/.asdf/installs/postgres/14.8/bin -d ../postgresql/data -D ../postgresql-replica-pg14-leader/data
    6. postgres -D /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader/data -k /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader -h ''
  5. Make postgresql-replica-pg14-leader a logical replica:
    1. Create a publication on the primary with CREATE PUBLICATION logical_replication_1 FOR ALL TABLES;
    2. Create a subscription on gdk psql -h /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader/ with CREATE SUBSCRIPTION logical_replication_subscription_1 CONNECTION 'host=/Users/dylangriffith/workspace/gitlab-development-kit/postgresql dbname=gitlabhq_development application_name=postgresql_replica_pg14_leader' PUBLICATION logical_replication_1 WITH (copy_data = false, create_slot = true);
  6. Create a physical replica postgresql-replica-pg14-replica of postgresql-replica-pg14-leader:
    cd pg14
    mkdir ../postgresql-replica-pg14-replica
    pg_basebackup -R -h  /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader -D /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-replica/data -P -U gitlab_replication --wal-method=fetch
  7. Add permissions to read LSN views/functions in pg14:
    1. cd pg14
    2. psql -h /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader/ -d gitlabhq_development
    3. create role gitlab login;
    4. grant select on table pg_replication_origin_status to gitlab;
    5. grant execute on function pg_show_replication_origin_status to gitlab;
    6. select remote_lsn from pg_replication_origin_status;
  8. Edit your config/database.yml to have all the replicas in the main config only (gitlab user and replication are only setup for main)
    username: gitlab
    load_balancing:
      hosts:
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql # 1
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica # 2
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-leader # 3
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-pg14-replica # 4
  9. Get the Host objects from the rails console
    lb = User.connection.load_balancer
    
    primary = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql" }
    replica = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql-replica" }
    logical_leader = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql-replica-pg14-leader" }
    logical_replica = lb.host_list.hosts.find { |h| File.basename(h.host) == "postgresql-replica-pg14-replica" }
    
    lsn = primary.primary_write_location
    
    { 1 => primary.caught_up?(lsn), 2 => replica.caught_up?(lsn), 3 => logical_leader.caught_up?(lsn), 4 => logical_replica.caught_up?(lsn) }

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dylan Griffith

Merge request reports