Skip to content

Switch from the old cluster to the new cluster (2023/10/23)

Followup of #655.

Great news. All the gitlab data has been hot migrated to the new cluster:

  • s3 for artifacts
  • the db is in standby ready (continuously synced)
  • the git data are also on standby ready, thanks to Ceph RBD mirroring

Which means we can do the cluster migration in just a "switch".

Tentative date is Monday October 23, at 10:00 UTC

Downtime should take no more than 1 hour

Steps required:

Preps:

  • On Sunday evening, change the TTL of gitlab.fd.o
  • note the various pods replicas:
    • exporter: 1
    • shell: 2
    • registry: 4
    • sidekiq: 2
    • webservice: 10
  • ensure all the backups are done talking to the current gitlab deployment:
    {namespace="gitlab", app="toolbox"} |= `Packing up backup tar`

Put down the current cluster:

  • patch k8s workloads to not access the db:
kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-registry -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-shell -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-sidekiq-native-chart-v2 -p "{\"spec\": {\"replicas\": 0}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 0}}"
  • patch the cluster deployment to:
    • have all gitaly pods point at the new cluster (this will stop the current gitaly pods)
    • point the db in globals to be the new cluster
    • point the artifacts/packages/pages S3 endpoint to be the new cluster
    • deploy

This should effectively put down all the gitaly nodes on the old cluster

DB switch:

  • on the old db pod:
PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select * from pg_stat_replication;
  • on the new db pod:
PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select * from pg_stat_wal_receiver;
  • ensure the WAL is correctly synced received_lsn == latest_end_lsn
  • terminate the postgresql pod on the old cluster
  • on the new db:
PGPASSWORD=${POSTGRES_POSTGRES_PASSWORD} psql -U postgres -d template1
template1> select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();

# The values above should be steady -> we can continue -> Ctrl-C

pg_ctl promote
  • change the deployment of the new db to not be a hot standby
  • change the deployment of the old db to be a standby of the new one

Gitaly switch

  • In the old cluster rook-ceph toolbox pod:
rbd mirror image demote replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image demote replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2
  • In the new cluster rook-ceph toolbox pod:
rbd mirror image promote replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image promote replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2
  • Change the new cluster deployment to have the gitaly pods hosted there
  • If pods are not able to start, in the new cluster rook-ceph toolbox pod:
rbd mirror image disable replicapool-ssd/csi-vol-5519d973-c048-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-2c18372e-c03b-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-32aa9540-c040-11eb-a24d-e288ac147db5
rbd mirror image disable replicapool-ssd/csi-vol-f3f04516-6890-11ec-ad8d-9adfeb4dc2a2

The last commands should not be required, but tests shown that it was 😢. The problem is that this effectively disables mirroring, but also removes the RBD volume on the remote cluster.

Put everything back up

  • patch k8s workloads to restart on the old cluster (the new one should already be running):
kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 10}}"
kubectl -n gitlab patch deploy gitlab-prod-registry -p "{\"spec\": {\"replicas\": 4}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-shell -p "{\"spec\": {\"replicas\": 2}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 1}}"
  • sidekiq is not restarted there!
  • patch k8s workloads on the new cluster:
kubectl -n gitlab patch deploy gitlab-prod-webservice-default -p "{\"spec\": {\"replicas\": 10}}"
kubectl -n gitlab patch deploy gitlab-prod-shell -p "{\"spec\": {\"replicas\": 2}}"
kubectl -n gitlab patch deploy gitlab-prod-gitlab-exporter -p "{\"spec\": {\"replicas\": 1}}"
kubectl -n gitlab patch deploy gitlab-prod-sidekiq-native-chart-v2 -p "{\"spec\": {\"replicas\": 2}}"
  • sidekiq is on the new cluster!

Edit DNS zone

  • rndc freeze
  • change IPs of gitlab.fd.o
  • bump serial
  • rndc thaw

Final DNS steps:

  • Ensure everything works fine
  • change the gitlab.fd.o TTL

Done!

Edited by Benjamin Tissoires