Geo: Improve post-failover clean up documentation
Problem
The current Geo disaster recovery documentation lacks clear guidance for post-failover scenarios.
This section https://docs.gitlab.com/administration/geo/disaster_recovery/#step-6-removing-the-former-secondarys-tracking-database is a bit vague.
Current State
The existing documentation:
- Focuses primarily on the failover process itself
- Mentions "bringing the old site back as a secondary" but doesn't provide detailed instructions
- Doesn't clearly address how to properly clean up Geo artifacts when a user wants to operate without Geo after failover
- Has led users to try the "Disable Geo" documentation, which isn't designed for post-failover scenarios
Proposal
Clarify that after the clean up steps of a site promotion, you're running a non-Geo site.
Implementation plan
- Add to the end of https://docs.gitlab.com/administration/geo/disaster_recovery/#step-6-removing-the-former-secondarys-tracking-database:
At this point, your promoted site is a normal GitLab site without Geo configured. Optionally, you can [bring the old site back as a secondary](bring_primary_back/#configure-the-former-primary-site-to-be-a-secondary-site).
Related Issues
- Related to Remove tracking database after failover
Additional Context
This issue was identified in an internal Slack conversation where a customer with a 25k Geo environment failed over, removed the old primary, and needed guidance on properly cleaning up Geo artifacts on the promoted site, like the Geo tracking database and PG replication slots.
Problems encountered:
- Can we delete the tracking database?
- Yes, and it is recommended.
- We got an error saying replication slot is active when trying to delete it as mentioned in Disabling Geo.
- The Disabling Geo doc shouldn't be used post-failover. They should have only followed https://docs.gitlab.com/administration/geo/disaster_recovery/#step-6-removing-the-former-secondarys-tracking-database. With a GitLab Linux package managed Postgres server, it wouldn't have replication slots after promotion. In this case, the customer was on RDS and had replicas configured so the replication slots are expected for them.
Edited by Michael Kozono