Switch from the old cluster to the new cluster (2023/10/23)

Followup of #655.

Great news. All the gitlab data has been hot migrated to the new cluster:

s3 for artifacts
the db is in standby ready (continuously synced)
the git data are also on standby ready, thanks to Ceph RBD mirroring

Which means we can do the cluster migration in just a "switch".

Tentative date is Monday October 23, at 10:00 UTC

Downtime should take no more than 1 hour

Steps required:

Preps:

    {namespace="gitlab", app="toolbox"} |= `Packing up backup tar`

Put down the current cluster:

patch k8s workloads to not access the db:

kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-registry -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-shell -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-sidekiq-native-chart-v2 -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 0}}"

patch the cluster deployment to:
- have all gitaly pods point at the new cluster (this will stop the current gitaly pods)
- point the db in globals to be the new cluster
- point the artifacts/packages/pages S3 endpoint to be the new cluster
- deploy

This should effectively put down all the gitaly nodes on the old cluster

DB switch:

on the old db pod:

PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select * from pg_stat_replication;

on the new db pod:

PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select * from pg_stat_wal_receiver;

ensure the WAL is correctly synced received_lsn == latest_end_lsn
terminate the postgresql pod on the old cluster
on the new db:

PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();

# The values above should be steady -> we can continue -> Ctrl-C

pg_ctl promote

change the deployment of the new db to not be a hot standby
change the deployment of the old db to be a standby of the new one

Gitaly switch

In the old cluster rook-ceph toolbox pod:

rbd mirror image demote replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2

In the new cluster rook-ceph toolbox pod:

rbd mirror image promote replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2

Change the new cluster deployment to have the gitaly pods hosted there
If pods are not able to start, in the new cluster rook-ceph toolbox pod:

rbd mirror image disable replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2

The last commands should not be required, but tests shown that it was 😢. The problem is that this effectively disables mirroring, but also removes the RBD volume on the remote cluster.

Put everything back up

patch k8s workloads to restart on the old cluster (the new one should already be running):

kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 10}}"
kubectl -n gitlab patch deploy gitlab-prod-registry -p "{\"spec\": {\"replicas\": 4}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-shell -p "{\"spec\": {\"replicas\": 2}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 1}}"

sidekiq is not restarted there!
patch k8s workloads on the new cluster:

kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 10}}"
kubectl -n gitlab patch deploy gitlab-prod-shell -p "{\"spec\": {\"replicas\": 2}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 1}}"
kubectl -n gitlab patch deploy gitlab-prod-sidekiq-native-chart-v2 -p "{\"spec\": {\"replicas\": 2}}"

sidekiq is on the new cluster!

Edit DNS zone

rndc freeze
change IPs of gitlab.fd.o
bump serial
rndc thaw

Final DNS steps:

Ensure everything works fine
change the gitlab.fd.o TTL

Done!

Edited Oct 23, 2023 by Benjamin Tissoires