Switch from the old cluster to the new cluster (2023/10/23)
Followup of #655.
Great news. All the gitlab data has been hot migrated to the new cluster:
- s3 for artifacts
 - the db is in standby ready (continuously synced)
 - the git data are also on standby ready, thanks to Ceph RBD mirroring
 
Which means we can do the cluster migration in just a "switch".
Tentative date is Monday October 23, at 10:00 UTC
Downtime should take no more than 1 hour
Steps required:
Preps:
- 
On Sunday evening, change the TTL of gitlab.fd.o  - 
note the various pods replicas: - 
exporter: 1  - 
shell: 2  - 
registry: 4  - 
sidekiq: 2  - 
webservice: 10  
 - 
 - 
ensure all the backups are done talking to the current gitlab deployment:  
    {namespace="gitlab", app="toolbox"} |= `Packing up backup tar`
Put down the current cluster:
- 
patch k8s workloads to not access the db:  
kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-registry -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-shell -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-sidekiq-native-chart-v2 -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 0}}"
- 
patch the cluster deployment to: - 
have all gitaly pods point at the new cluster (this will stop the current gitaly pods)  - 
point the db in globalsto be the new cluster - 
point the artifacts/packages/pages S3 endpoint to be the new cluster  - 
deploy  
 - 
 
This should effectively put down all the gitaly nodes on the old cluster
DB switch:
- 
on the old db pod:  
PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select * from pg_stat_replication;
- 
on the new db pod:  
PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select * from pg_stat_wal_receiver;
- 
ensure the WAL is correctly synced received_lsn == latest_end_lsn - 
terminate the postgresql pod on the old cluster  - 
on the new db:  
PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();
# The values above should be steady -> we can continue -> Ctrl-C
pg_ctl promote
- 
change the deployment of the new db to not be a hot standby  - 
change the deployment of the old db to be a standby of the new one  
Gitaly switch
- 
In the old cluster rook-ceph toolbox pod:  
rbd mirror image demote replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2
- 
In the new cluster rook-ceph toolbox pod:  
rbd mirror image promote replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2
- 
Change the new cluster deployment to have the gitaly pods hosted there  - 
If pods are not able to start, in the new cluster rook-ceph toolbox pod:  
rbd mirror image disable replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2
The last commands should not be required, but tests shown that it was 
Put everything back up
- 
patch k8s workloads to restart on the old cluster (the new one should already be running):  
kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 10}}"
kubectl -n gitlab patch deploy gitlab-prod-registry -p "{\"spec\": {\"replicas\": 4}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-shell -p "{\"spec\": {\"replicas\": 2}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 1}}"
- 
sidekiqis not restarted there! - 
patch k8s workloads on the new cluster:  
kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 10}}"
kubectl -n gitlab patch deploy gitlab-prod-shell -p "{\"spec\": {\"replicas\": 2}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 1}}"
kubectl -n gitlab patch deploy gitlab-prod-sidekiq-native-chart-v2 -p "{\"spec\": {\"replicas\": 2}}"
- 
sidekiqis on the new cluster! 
Edit DNS zone
- 
rndc freeze - 
change IPs of gitlab.fd.o  - 
bump serial  - 
rndc thaw 
Final DNS steps:
- 
Ensure everything works fine  - 
change the gitlab.fd.o TTL  
Done!
Edited  by Benjamin Tissoires