Implement Phase 7 rollback strategy

Summary

We are going to take downtime when executing Phase 7. This means that no writes should happen and the main cluster and the CI cluster will be perfectly in sync after the replication lag has gone to zero.

To mitigate the risk that the promotion of the read-only CI cluster to read-write fails, we need to define a rollback strategy. See gitlab-com/gl-infra&693 (closed) for more details

The purpose of the rollback strategy is to be able to quickly and reliably reset us to Phase 4, which is currently rolled out in STG and PRD.

Problem

Promoting the CI cluster may go wrong. If it does, recreating the cluster quickly is important to avoid prolonged downtime. Given the size of the DB (terrabytes), we can't easily recreate the cluster from scratch.

Proposal

Restore CI PGBouncer writes to point back to Main Primary and restore read queries to point back to Main replicas. This effectively rolls back "Phase 3" without rolling back "Phase 4". Additionally we'll need to bump all CI sequences to avoid conflicts with records in the CI database that have been lost.

We'll accomplish this by updating the Main Patroni nodes to advertise as the ci-db-replica.service.consul in Consul (as well as the db-replica.service-consul) and we'll rename the Consul service on all CI Patroni nodes while they are rebuilt from scratch. Then when they are rebuilt it's just a matter of renaming back the Consul service on these CI Patroni nodes and removing the extra Consul service on the Main nodes.

In order to avoid client connection limits on the Main Patroni PGBouncer processes we also need to deploy 3 new PGBouncer processes just to handle the CI read-only traffic per #361759 (comment 963711253)

It works because:

Main cluster already has enough replicas to serve the volume of reads
CI replicas (before the rollout) were just a slightly delayed version of Main replicas so it's equivalent from application perspective

It's good because:

It doesn't require us to deploy a 2nd Patroni cluster
It is very fast because we don't need to reconfigure any Rails processes
When the CI Patroni nodes are rebuilt and replicating well we can just re-add them back to Consul and then remove Main from Consul without any urgency because it's OK to use Main and CI for reads at the same time
It doesn't create a new CI Patroni cluster that has a confusing name like ci-standby that will likely require us to eventually failback again later to the real CI cluster or live with this name permanently

Implement Phase 7 rollback strategy

Summary

Problem

Proposal

In diagrams

Before the failover

After the failover

After the rollback