Implement Phase 7 rollback strategy
Summary
We are going to take downtime when executing Phase 7. This means that no writes should happen and the main
cluster and the CI
cluster will be perfectly in sync after the replication lag has gone to zero.
To mitigate the risk that the promotion of the read-only CI
cluster to read-write fails, we need to define a rollback strategy. See gitlab-com/gl-infra&693 (closed) for more details
The purpose of the rollback strategy is to be able to quickly and reliably reset us to Phase 4, which is currently rolled out in STG and PRD.
Problem
Promoting the CI
cluster may go wrong. If it does, recreating the cluster quickly is important to avoid prolonged downtime. Given the size of the DB (terrabytes), we can't easily recreate the cluster from scratch.
Proposal
Restore CI PGBouncer writes to point back to Main Primary and restore read queries to point back to Main replicas. This effectively rolls back "Phase 3" without rolling back "Phase 4". Additionally we'll need to bump all CI sequences to avoid conflicts with records in the CI database that have been lost.
We'll accomplish this by updating the Main Patroni nodes to advertise as the ci-db-replica.service.consul
in Consul (as well as the db-replica.service-consul
) and we'll rename the Consul service on all CI Patroni nodes while they are rebuilt from scratch. Then when they are rebuilt it's just a matter of renaming back the Consul service on these CI Patroni nodes and removing the extra Consul service on the Main nodes.
In order to avoid client connection limits on the Main Patroni PGBouncer processes we also need to deploy 3 new PGBouncer processes just to handle the CI read-only traffic per #361759 (comment 963711253)
It works because:
- Main cluster already has enough replicas to serve the volume of reads
- CI replicas (before the rollout) were just a slightly delayed version of Main replicas so it's equivalent from application perspective
It's good because:
- It doesn't require us to deploy a 2nd Patroni cluster
- It is very fast because we don't need to reconfigure any Rails processes
- When the CI Patroni nodes are rebuilt and replicating well we can just re-add them back to Consul and then remove Main from Consul without any urgency because it's OK to use Main and CI for reads at the same time
- It doesn't create a new CI Patroni cluster that has a confusing name like
ci-standby
that will likely require us to eventually failback again later to the real CI cluster or live with this name permanently