Skip to content

Add ops feature flags to control load balancer replication_lag_time

Matt Kasa requested to merge stomlinson/load-balancer-ff-force-replicas into master

What does this MR do and why?

This MR adds two ops feature flags to influence the application database load balancer to use replicas that would normally not be used due to their replication lag time exceeding max_replication_lag_time. One doubles max_replication_lag_time, and the other ignores it completely.

The intent is to make these available to be used to prevent an outage in the event the replicas cannot keep up with the WAL rate and the primary becomes saturated without available replicas.

  • load_balancer_double_replication_lag_time should be tried first.
  • load_balancer_ignore_replication_lag_time should be a last resort.

Relates to: https://gitlab.com/gitlab-org/gitlab/-/issues/429935

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

  1. Configure GDK with database load balancing and at least one replica (see https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md).
  2. Using the rails console, perform a read query, verify it used the replica.
  3. Simulate 90 seconds of lag (set recovery_min_apply_delay = '90s' in the replica postgres.conf).
  4. Using the rails console, perform a read query, verify it used the primary.
  5. Enable the load_balancer_double_replication_lag_time feature flag.
  6. Using the rails console, perform a read query, verify it used the replica even though the replica is lagged.

The same test can be performed with more than 120s of lag using the load_balancer_ignore_replication_lag_time feature flag.

Edited by Matt Kasa

Merge request reports