Change geo health check to work for Logical Replication setups

What does this MR do and why?

This MR updates the Geo health check to recognize PostgreSQL logical replication as a valid replication method when the geo_postgresql_replication_agnostic feature flag is enabled.

Solution

Added logical_replication_mode? method that returns true when:

  • geo_postgresql_replication_agnostic feature flag is enabled, AND
  • Database is not in recovery mode (pg_is_in_recovery() = false)

When in logical replication mode, the health check:

  • Skips the "database is writable" error (this is expected for logical replication)
  • Checks for pending migrations and reports them (since migrations must be run manually on secondaries with logical replication)

Dashboard Visibility

These errors can only be partially seen in the Geo Sites dashboard?

  • health_status: The dashboard displays a "Healthy" or "Unhealthy" badge via the GeoSiteHealthStatus component. This is visible in the site header.
  • health (error message): The actual error message text (e.g., "Databases have pending migrations: main") is exposed via the API (GET /api/v4/geo_sites) in the health field, but the frontend does not currently display it. The Vue components only use healthStatus for the badge, not the detailed health message.
  • outdated?: If Geo::MetricsUpdateWorker hasn't run in 60+ minutes, the status is marked unhealthy due to being stale, regardless of the health check result.

API Response Example

When unhealthy, the API returns:

{
  "healthy": false,
  "health": "Databases have pending migrations: main.\nYou may have to run `gitlab-rake db:migrate` on the secondary.",
  "health_status": "Unhealthy"
}

References

Relates to #593326

How to set up and validate locally

Via Rails console (bin/rails console) or runner:

Check if logical replication mode is detected

Gitlab::Geo::HealthCheck.new.logical_replication_mode?

Check which databases have pending migrations

Gitlab::Geo::HealthCheck.new.pending_migration_databases

Run the full health check (returns error message or empty string if healthy)

Gitlab::Geo::HealthCheck.new.perform_checks

Via rake task (shows all Geo checks including the new migration check)

bundle exec rake gitlab:geo:check

The real tests can be bit more involved, I created migrations, rolled them back, broke configurations among other things, here's the list of test cases I've done with Duo's help.

Test Description Result
Test 1 Feature Flag Disabled PASSED
Test 2 Pending Migrations on Secondary PASSED
Test 3 Subscription Disabled PASSED
Test 5 High Replication Lag PASSED
Test 6 Stale Status Record PASSED
Test 7 Geo Database Migration Mismatch PASSED
Test 8 Tracking Database Not Configured PASSED
Test 9 Schema Structure Mismatch PASSED

Test Details

Test 1: Feature Flag Disabled

Scenario: Disable the geo_postgresql_replication_agnostic feature flag while running in logical replication mode (DB is writable).

Steps:

Feature.disable(:geo_postgresql_replication_agnostic)
Gitlab::Geo::HealthCheck.new.perform_checks

Expected: Health check returns error about writable database.

Result: Returned: "Geo node has a database that is writable which is an indication it is not configured for replication with the primary node."


Test 2: Pending Migrations on Secondary

Scenario: Primary has migrations that secondary doesn't have (simulates logical replication where migrations must be run manually on secondary).

Steps:

  1. Created a fake future migration file db/schema_migrations/20990101000000 on secondary
  2. Ran health check on secondary

Expected: Health check detects pending migrations and reports which databases need migration.

Result: Returned: "Databases have pending migrations: main.\nYou may have to run \gitlab-rake db:migrate` on the secondary."`


Test 3: Subscription Disabled

Scenario: Logical replication subscription is disabled.

Steps:

ALTER SUBSCRIPTION gitlab_geo_sub DISABLE;

Expected: db_replication_lag_seconds returns nil, dashboard shows unhealthy status.

Result: Replication lag query returned no active subscription data.


Test 5: High Replication Lag

Scenario: Secondary falls behind primary due to paused replication.

Steps:

  1. Disabled subscription on secondary
  2. Made writes on primary
  3. Re-enabled subscription
  4. Checked replication lag

Expected: db_replication_lag_seconds returns high value.

Result: Lag was correctly reported based on pg_stat_subscription.latest_end_time.


Test 6: Stale Status Record

Scenario: Geo::MetricsUpdateWorker hasn't run recently.

Steps:

  1. Checked GeoNodeStatus#outdated? logic
  2. Verified status records older than 10 minutes are flagged

Expected: GeoNodeStatus#outdated? returns true for stale records.

Result: Status records are correctly identified as outdated when updated_at is older than threshold.


Test 7: Geo Database Migration Mismatch

Scenario: Geo tracking database schema version doesn't match latest migration files.

Steps:

  1. Created fake future geo migration ee/db/geo/migrate/20990101000000_test_geo_migration_mismatch.rb
  2. Ran health check on secondary

Expected: Health check reports geo database version mismatch.

Result: Returned: "Geo database version (20260316212616) does not match latest migration (20990101000000).\nYou may have to run \gitlab-rake db:migrate:geo` as root on the secondary."`


Test 8: Tracking Database Not Configured

Scenario: The geo: section is missing from config/database.yml.

Steps:

  1. Renamed geo: to geo_disabled: in config/database.yml
  2. Attempted to run Rails

Expected: Health check returns "Geo database configuration file is missing."

Result: Rails fails to boot entirely when geo database is not configured. This is expected behavior - the health check message would only appear if Rails could boot but geo wasn't configured, but in practice the missing config prevents Rails initialization. This is a more severe failure mode that is caught before the health check even runs.


Test 9: Schema Structure Mismatch

Scenario: Primary and secondary have different table schemas (column added on primary but not secondary).

Steps:

  1. Added test column on primary:

    ALTER TABLE users ADD COLUMN test_schema_mismatch_col INTEGER;
  2. Triggered replication with an update:

    UPDATE users SET test_schema_mismatch_col = 1 WHERE id = 1;
  3. Checked subscription stats on secondary:

    SELECT subname, apply_error_count, sync_error_count FROM pg_stat_subscription_stats;

Expected: Logical replication errors on affected tables, apply_error_count increases.

Result: apply_error_count increased to 159, confirming schema mismatch causes replication apply errors. This validates that pg_stat_subscription_stats can be used to detect schema drift between primary and secondary.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Natanael Silva

Merge request reports

Loading