Geo: Replication should be easy to pause and resume (#2159) · Epics · GitLab.org

Geo: Replication should be easy to pause and resume

### Summary There are several scenarios in which it can be desirable for systems administrators to pause, resume and kill any replication to a secondary node. Currently, these operations are not well supported and require a number of manual steps. This limits Geo, for example in planned failover scenarios or when high load on the primary is detected. We should implement tools that allow systems administrators to perform these steps easily and with confidence. ### Problem to solve We can distinguish between `pause`, `resume` and `kill`. - `Pause` and `resume` are graceful operations similar to a linux `SIGINT` - you allow replication to shutdown gracefully and wait e.g. for all replication events ongoing to finish first. Resume should pick up from this without issue. - ~~Kill` - this should shutdown any replication as soon as possible without waiting for events to finish. This should be an emergency measure/last resort because it may cause other downstream issues. This could be similar to `SIGKILL`~~ **Update:** This is likely to be undesirable: https://gitlab.com/gitlab-org/gitlab/issues/35914 There are at least two scenarios in which it can be desirable to pause and resume replication: 1. During a planned failover after all items are replicated it can be desirable to pause any further replication 1. During upgrades. If a secondary and primary are fully in sync before and upgrade pausing any replication during an upgrade of the primary eliminates the possibility of replicating damaging changes to the secondary An example scenario could be: * Upgrade primary * Primary breaks * Rollback primary * Uh oh, data was lost somewhere * Failover to DR secondary, but wait, data is already lost there as well * Restore backup of primary (ouch) ~~`Killing` Geo may be desirable in the following scenarios:~~ 1. We put too much load on the primary and it is affecting the stability of the system. This may be a huge issue for large instances, such as .com 1. Some adverse event happened on the primary that should not be synced to the secondary by any means **Update**: Based on https://gitlab.com/gitlab-org/gitlab/issues/35914 a non-graceful kill-switch is not desirable ### Proposal 1. Enhance the existing `pause` logic and include database replication, alternatively distinguish between each data type but that will be a later iteration 1. On the **secondary** implement two different rake tasks to manage pause, resume and kill e.g. * `gitlab-rake geo:pause` * `gitlab-rake geo:resume` * ~~`gitlab-rake geo:kill`~~ 1. Add documentation that makes it clear when what option is appropriate 1. Add pausing and resuming to the UI. This will be a later iteration. 1. Redesign the pause replication button on the UI (it does not pause database replication) 1. We may want to have end-to-end tests for this ### Intended users  * [Systems administrators](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator) ### Further details ### What does success look like, and how can we measure that? * We can pause and resume replication gracefully in an HA and non-HA setting from the **secondary** * Administrators can navigate to the UI and easily pause replication (second iteration) ### What is the type of buyer? * Premium * Ultimate ### Links / references

epic