PostgreSQL fails to start during Geo recovery due to leftover replication slots
- GET version: 3.8.0
- Cloud Provider: GCP/AWS/Azure/Other -- All providers
- Environment configuration: Omnibus Geo deployments with PostgreSQL
Summary
PostgreSQL fails to start when recovering an old Primary as a new Secondary, due to leftover replication slots conflicting with the max_replication_slots = 0 setting configured using the gitlab_geo_recovery playbook.
Problem Statement
When recovering a Geo deployment (typically after a failover scenario):
- The old Primary is demoted and will become a new Secondary
- The recovery playbook sets
max_replication_slots = 0on the new Secondary (since Secondary sites don't need replication slots) - If replication slots from the old Primary configuration still exist in PostgreSQL, PostgreSQL refuses to start with the error:
2025-09-02_06:06:53.58305 FATAL: too many replication slots active before shutdown
- This prevents the recovery process from completing
Current Behavior
- Recovery playbook reconfigures PostgreSQL with
max_replication_slots = 0 - Leftover replication slots from Primary configuration cause PostgreSQL startup failure
- Manual intervention required to drop replication slots before recovery can proceed
Expected Behavior
- Recovery playbook should automatically drop all replication slots on the demoted Primary before reconfiguring PostgreSQL
- PostgreSQL should start successfully with
max_replication_slots = 0 - Recovery process should complete without manual intervention
Reproduction Steps
- Set up a Geo deployment with Primary and Secondary sites
- Trigger a failover scenario requiring Primary demotion
- Run the Geo recovery playbook to set up the old Primary as a new Secondary
- Observe PostgreSQL failing to start on the demoted Primary (new Secondary)
Proposed Solution
Add a task in the recovery playbook to drop all replication slots before the main recovery process reconfigures PostgreSQL. This ensures no conflicts when max_replication_slots is set to 0.
See MR: MR#1754: Add task to drop replication from secondary sites before proceeding with recovery
Related Issues/Tickets
- Internal Ticket: ZD#650634
- Is there a risk of dropping slots that might be needed during the recovery transition period?