PostgreSQL fails to start during Geo recovery due to leftover replication slots

  • GET version: 3.8.0
  • Cloud Provider: GCP/AWS/Azure/Other -- All providers
  • Environment configuration: Omnibus Geo deployments with PostgreSQL

Summary

PostgreSQL fails to start when recovering an old Primary as a new Secondary, due to leftover replication slots conflicting with the max_replication_slots = 0 setting configured using the gitlab_geo_recovery playbook.

Problem Statement

When recovering a Geo deployment (typically after a failover scenario):

  1. The old Primary is demoted and will become a new Secondary
  2. The recovery playbook sets max_replication_slots = 0 on the new Secondary (since Secondary sites don't need replication slots)
  3. If replication slots from the old Primary configuration still exist in PostgreSQL, PostgreSQL refuses to start with the error:
2025-09-02_06:06:53.58305 FATAL:  too many replication slots active before shutdown
  1. This prevents the recovery process from completing

Current Behavior

  • Recovery playbook reconfigures PostgreSQL with max_replication_slots = 0
  • Leftover replication slots from Primary configuration cause PostgreSQL startup failure
  • Manual intervention required to drop replication slots before recovery can proceed

Expected Behavior

  • Recovery playbook should automatically drop all replication slots on the demoted Primary before reconfiguring PostgreSQL
  • PostgreSQL should start successfully with max_replication_slots = 0
  • Recovery process should complete without manual intervention

Reproduction Steps

  1. Set up a Geo deployment with Primary and Secondary sites
  2. Trigger a failover scenario requiring Primary demotion
  3. Run the Geo recovery playbook to set up the old Primary as a new Secondary
  4. Observe PostgreSQL failing to start on the demoted Primary (new Secondary)

Proposed Solution

Add a task in the recovery playbook to drop all replication slots before the main recovery process reconfigures PostgreSQL. This ensures no conflicts when max_replication_slots is set to 0.

See MR: MR#1754: Add task to drop replication from secondary sites before proceeding with recovery

  1. Is there a risk of dropping slots that might be needed during the recovery transition period?