Fix LSN Host#caught_up?/replica_is_up_to_date? for logical replicas
What does this MR do and why?
These methods previously only relied on pg_last_wal_replay_lsn()
for
replicas. In a logical replica this function cannot be meaningfully
compared against the LSN from the primary because they have forked. As
such we need to make use of
remote_lsn from pg_replication_origin_status
which provides a way to
determine where the logical replication is up to with respect to LSNs
from the primary. This solution was proposed in
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1396755188
.
NOTE ABOUT FEATURE FLAGS
This part of the codebase is very risky to work with as it is a very low level part of the code. Ideally we could feature flag changes like this in order to quickly rollback if it doesn't work. However since the feature flags are stored in Postgres this is very much a recursive problem and in the past we've decided not to try and use feature flags in here. An alternative we have tried in the past is to use environment variables. But environment variables don't really speed up time to mitigate incidents anyway as they still require deployments to update them. Additionally since they add extra logic they could also make bugs more likely.
So the best we can rely on is careful local testing and the fact that GSTG and GPRD are similar enough and we run QA on this on GSTG before GPRD. We also don't have realistic unit test coverage for this because we don't run replicas in any CI environment. This would again need to be caught in higher level testing situations like QA.
Screenshots or screen recordings
Demo of local testing at https://youtu.be/dxyktRJIJG8 . First 5 mins is the actual demo of caught_up?
working as expected then I spend a minute explaining the code and then I spend the next 15 minutes trying to work out how to get enough replication lag to test replica_is_up_to_date?
so you could skip for spoilers at the end but replica_is_up_to_date?
also ends up working as expected.
How to set up and validate locally
- Follow https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing_with_service_discovery.md
- Update
wal_level = logical
in<gdk-root>/postgresql/data/replication.conf
- Promote
postgresql-replica-2
to a logical replica:gdk stop postgresql-replica-2
- Edit
postgresql-replica-2/data/postgresql.conf
and removeinclude 'replication.conf'
from the end rm postgresql-replica-2/data/standby.signal
gdk start postgresql-replica-2
- Create a publication on the primary with
CREATE PUBLICATION logical_replication_1 FOR ALL TABLES;
- Create a subscription on
postgresql-replica-2
withCREATE SUBSCRIPTION logical_replication_subscription_1 CONNECTION 'host=/Users/dylangriffith/workspace/gitlab-development-kit/postgresql dbname=gitlabhq_development application_name=postgresql_2' PUBLICATION logical_replication_1 WITH (copy_data = false, create_slot = true);
- Create a physical replica of
postgresql-replica-2
:mkdir postgresql-replica-4 pg_basebackup -R -h $(pwd)/postgresql-replica-2 -D $(pwd)/postgresql-replica-4/data -P -U gitlab_replication --wal-method=fetch
- Edit your
config/database.yml
to have all the replicasload_balancing: hosts: - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql # 1 - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica # 2 - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-2 # 4 - /Users/dylangriffith/workspace/gitlab-development-kit/postgresql-replica-4 # 4
- Get the
Host
objects from the rails consolelb = User.connection.load_balancer h = lb.host_list.send(:next_host) # Cycle enough times to assign variables for all of them h.host lsn = h.primary_write_location h.caught_up?(lsn) h.replication_lag_size
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.