Skip to content

Fix LSN Host#caught_up?/replica_is_up_to_date? for logical replicas

Dylan Griffith requested to merge fix-lsn-check-for-logical-replicas into master

What does this MR do and why?

These methods previously only relied on pg_last_wal_replay_lsn() for replicas. In a logical replica this function cannot be meaningfully compared against the LSN from the primary because they have forked. As such we need to make use of remote_lsn from pg_replication_origin_status which provides a way to determine where the logical replication is up to with respect to LSNs from the primary. This solution was proposed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1396755188 .

NOTE ABOUT FEATURE FLAGS

This part of the codebase is very risky to work with as it is a very low level part of the code. Ideally we could feature flag changes like this in order to quickly rollback if it doesn't work. However since the feature flags are stored in Postgres this is very much a recursive problem and in the past we've decided not to try and use feature flags in here. An alternative we have tried in the past is to use environment variables. But environment variables don't really speed up time to mitigate incidents anyway as they still require deployments to update them. Additionally since they add extra logic they could also make bugs more likely.

So the best we can rely on is careful local testing and the fact that GSTG and GPRD are similar enough and we run QA on this on GSTG before GPRD. We also don't have realistic unit test coverage for this because we don't run replicas in any CI environment. This would again need to be caught in higher level testing situations like QA.

Screenshots or screen recordings

Demo of local testing at https://youtu.be/dxyktRJIJG8 . First 5 mins is the actual demo of caught_up? working as expected then I spend a minute explaining the code and then I spend the next 15 minutes trying to work out how to get enough replication lag to test replica_is_up_to_date? so you could skip for spoilers at the end but replica_is_up_to_date? also ends up working as expected.

How to set up and validate locally

  1. Follow https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing_with_service_discovery.md
  2. Update wal_level = logical in <gdk-root>/postgresql/data/replication.conf
  3. Promote postgresql-replica-2 to a logical replica:
    1. gdk stop postgresql-replica-2
    2. Edit postgresql-replica-2/data/postgresql.conf and remove include 'replication.conf' from the end
    3. rm postgresql-replica-2/data/standby.signal
    4. gdk start postgresql-replica-2
    5. Create a publication on the primary with CREATE PUBLICATION logical_replication_1 FOR ALL TABLES;
    6. Create a subscription on postgresql-replica-2 with CREATE SUBSCRIPTION logical_replication_subscription_1 CONNECTION 'host=/Users/dylangriffith/workspace/gitlab-development-kit/postgresql dbname=gitlabhq_development application_name=postgresql_2' PUBLICATION logical_replication_1 WITH (copy_data = false, create_slot = true);
  4. Create a physical replica of postgresql-replica-2:
    mkdir postgresql-replica-4
    pg_basebackup -R -h $(pwd)/postgresql-replica-2 -D $(pwd)/postgresql-replica-4/data -P -U gitlab_replication --wal-method=fetch
  5. Edit your config/database.yml to have all the replicas
    load_balancing:
      hosts:
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql # 1
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica # 2
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-2 # 4
        - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-4 # 4
  6. Get the Host objects from the rails console
    lb = User.connection.load_balancer
    h = lb.host_list.send(:next_host) # Cycle enough times to assign variables for all of them
    h.host
    lsn = h.primary_write_location
    h.caught_up?(lsn)
    h.replication_lag_size

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dylan Griffith

Merge request reports