Change geo health check to work for Logical Replication setups
What does this MR do and why?
This MR updates the Geo health check to recognize PostgreSQL logical replication as a valid replication method when the geo_postgresql_replication_agnostic feature flag is enabled.
Solution
Added logical_replication_mode? method that returns true when:
geo_postgresql_replication_agnosticfeature flag is enabled, AND- Database is not in recovery mode (
pg_is_in_recovery() = false)
When in logical replication mode, the health check:
- Skips the "database is writable" error (this is expected for logical replication)
- Checks for pending migrations and reports them (since migrations must be run manually on secondaries with logical replication)
Dashboard Visibility
These errors can only be partially seen in the Geo Sites dashboard?
health_status: The dashboard displays a "Healthy" or "Unhealthy" badge via theGeoSiteHealthStatuscomponent. This is visible in the site header.health(error message): The actual error message text (e.g., "Databases have pending migrations: main") is exposed via the API (GET /api/v4/geo_sites) in thehealthfield, but the frontend does not currently display it. The Vue components only usehealthStatusfor the badge, not the detailedhealthmessage.outdated?: IfGeo::MetricsUpdateWorkerhasn't run in 60+ minutes, the status is marked unhealthy due to being stale, regardless of the health check result.
API Response Example
When unhealthy, the API returns:
{
"healthy": false,
"health": "Databases have pending migrations: main.\nYou may have to run `gitlab-rake db:migrate` on the secondary.",
"health_status": "Unhealthy"
}References
Relates to #593326
How to set up and validate locally
Via Rails console (bin/rails console) or runner:
Check if logical replication mode is detected
Gitlab::Geo::HealthCheck.new.logical_replication_mode?Check which databases have pending migrations
Gitlab::Geo::HealthCheck.new.pending_migration_databasesRun the full health check (returns error message or empty string if healthy)
Gitlab::Geo::HealthCheck.new.perform_checksVia rake task (shows all Geo checks including the new migration check)
bundle exec rake gitlab:geo:checkThe real tests can be bit more involved, I created migrations, rolled them back, broke configurations among other things, here's the list of test cases I've done with Duo's help.
| Test | Description | Result |
|---|---|---|
| Test 1 | Feature Flag Disabled | |
| Test 2 | Pending Migrations on Secondary | |
| Test 3 | Subscription Disabled | |
| Test 5 | High Replication Lag | |
| Test 6 | Stale Status Record | |
| Test 7 | Geo Database Migration Mismatch | |
| Test 8 | Tracking Database Not Configured | |
| Test 9 | Schema Structure Mismatch |
Test Details
Test 1: Feature Flag Disabled
Scenario: Disable the geo_postgresql_replication_agnostic feature flag while running in logical replication mode (DB is writable).
Steps:
Feature.disable(:geo_postgresql_replication_agnostic)
Gitlab::Geo::HealthCheck.new.perform_checksExpected: Health check returns error about writable database.
Result: "Geo node has a database that is writable which is an indication it is not configured for replication with the primary node."
Test 2: Pending Migrations on Secondary
Scenario: Primary has migrations that secondary doesn't have (simulates logical replication where migrations must be run manually on secondary).
Steps:
- Created a fake future migration file
db/schema_migrations/20990101000000on secondary - Ran health check on secondary
Expected: Health check detects pending migrations and reports which databases need migration.
Result: "Databases have pending migrations: main.\nYou may have to run \gitlab-rake db:migrate` on the secondary."`
Test 3: Subscription Disabled
Scenario: Logical replication subscription is disabled.
Steps:
ALTER SUBSCRIPTION gitlab_geo_sub DISABLE;Expected: db_replication_lag_seconds returns nil, dashboard shows unhealthy status.
Result:
Test 5: High Replication Lag
Scenario: Secondary falls behind primary due to paused replication.
Steps:
- Disabled subscription on secondary
- Made writes on primary
- Re-enabled subscription
- Checked replication lag
Expected: db_replication_lag_seconds returns high value.
Result: pg_stat_subscription.latest_end_time.
Test 6: Stale Status Record
Scenario: Geo::MetricsUpdateWorker hasn't run recently.
Steps:
- Checked
GeoNodeStatus#outdated?logic - Verified status records older than 10 minutes are flagged
Expected: GeoNodeStatus#outdated? returns true for stale records.
Result: updated_at is older than threshold.
Test 7: Geo Database Migration Mismatch
Scenario: Geo tracking database schema version doesn't match latest migration files.
Steps:
- Created fake future geo migration
ee/db/geo/migrate/20990101000000_test_geo_migration_mismatch.rb - Ran health check on secondary
Expected: Health check reports geo database version mismatch.
Result: "Geo database version (20260316212616) does not match latest migration (20990101000000).\nYou may have to run \gitlab-rake db:migrate:geo` as root on the secondary."`
Test 8: Tracking Database Not Configured
Scenario: The geo: section is missing from config/database.yml.
Steps:
- Renamed
geo:togeo_disabled:inconfig/database.yml - Attempted to run Rails
Expected: Health check returns "Geo database configuration file is missing."
Result:
Test 9: Schema Structure Mismatch
Scenario: Primary and secondary have different table schemas (column added on primary but not secondary).
Steps:
-
Added test column on primary:
ALTER TABLE users ADD COLUMN test_schema_mismatch_col INTEGER; -
Triggered replication with an update:
UPDATE users SET test_schema_mismatch_col = 1 WHERE id = 1; -
Checked subscription stats on secondary:
SELECT subname, apply_error_count, sync_error_count FROM pg_stat_subscription_stats;
Expected: Logical replication errors on affected tables, apply_error_count increases.
Result: apply_error_count increased to 159, confirming schema mismatch causes replication apply errors. This validates that pg_stat_subscription_stats can be used to detect schema drift between primary and secondary.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.