Move database cluster to Google Cloud
Instead of waiting for all of GitLab to be ready to move to Google Cloud I would like to start investigating our options/techniques for moving all databases to Google Cloud ahead of time. This should reduce pressure on Production as we don't have to move everything all at once.
The rough procedure I have in mind would be the following:
1. Set up a new cluster in Google cloud with one primary, and 3 secondaries. These secondaries replicate from the Google Cloud primary.
1. The Google Cloud primary will replicate from Azure (so it's actually a secondary)
1. Once Google Cloud is in sync we update the Azure secondaries one by one so they replicate from the Google Cloud primary (still a secondary). We wait until everything is in sync and monitor replication lag
1. We add the Google Cloud secondaries to the load balancer and monitor any increases in latencies we might see
1. If all is well we fail over from the Azure primary to the Google Cloud primary.
1. We remove the Azure hosts from the DB load balancer and terminate the hosts
Graphically this leads to the following setup:
```mermaid
graph LR
gc-primary[Primary]
az-primary[Primary]
gitlab[GitLab]
style az-primary fill:#96DCFF
style gc-primary fill:#96DCFF
subgraph Google Cloud
gc-primary --> gc-sec1[Secondary 1]
gc-primary --> gc-sec2[Secondary 2]
gc-primary --> gc-sec3[Secondary 3]
end
subgraph Azure
az-primary --> az-sec1[Secondary 1]
az-primary --> az-sec2[Secondary 2]
az-primary --> az-sec3[Secondary 3]
end
az-primary --> gc-primary
gitlab --> az-primary
```
After the failover the topography will be:
```mermaid
graph LR
gc-primary[Primary]
az-primary["Primary (disabled)"]
gitlab[GitLab]
style az-primary fill:#96DCFF
style gc-primary fill:#96DCFF
subgraph Google Cloud
gc-primary --> gc-sec1[Secondary 1]
gc-primary --> gc-sec2[Secondary 2]
gc-primary --> gc-sec3[Secondary 3]
end
subgraph Azure
az-primary --> az-sec1[Secondary 1]
az-primary --> az-sec2[Secondary 2]
az-primary --> az-sec3[Secondary 3]
end
gitlab --> gc-primary
```
This approach is fairly straightforward. Until we failover a rollback is trivial, after that we just fail over back to Azure.
The reason for this particular setup is deliberate: by having Google Cloud secondaries replicate from the Google cloud primary there are fewer hosts that we need to change during the failover. If we can also somehow ensure repmgr only cares about the two primaries (and never promotes one of the secondaries) we can also prevent a split brain from occurring.
issue