Cutover dev.gitlab.org from Azure to GCP

Production Change

Change Summary

Performs the cutover of dev.gitlab.org from running on azure to running on GCP

For epic &389 (closed)

Change Details

Services Impacted - ServiceUnknown
Change Technician - @ggillies
Change Reviewer - @hphilipps
Time tracking - 240 minutes
Downtime Component - YES

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 mins

Set label changein-progress on this issue
Announce that this change is starting on https://gitlab.slack.com/archives/C0259241C/p1643278197350200
Confirm Omnibus nightly builds are inactive - https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipeline_schedules
Confirm Auto-Deploy is paused /chatops run auto_deploy pause
Confirm there are no running deployments
Set silence on alerts for dev.gitlab.org (match fqdn="dev.gitlab.org"): https://alerts.gitlab.net/#/silences/new

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 200

Merge MR and applied for locking HPA's
Merge and apply MR for decreasing the DNS TTL of dev.gitlab.org
ssh onto the new dev.gitlab.org node and make sure all services are stopped on that

ssh 34.139.135.192
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status

ssh onto the old dev.gitlab.org node and stop all services running on the machine

ssh dev.gitlab.org
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status

Do a test db dump on the source machine to validate no db corruption in source

time sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -h /var/opt/gitlab/postgresql -d gitlabhq_production --jobs 32 -Fd -f /tmp/parallel_dump/

Perform a final and complete sync of /var/opt/gitlab from the old dev node to the new dev node, removing postsql data first

sudo su -
rm -rf /var/opt/gitlab/postgresql/data
exit
sudo -E time rsync --rsync-path="sudo rsync" -av --delete --exclude 'lost+found' --exclude gitlab-rails/shared/artifacts/75/59/7559ca4a957c8c82ba04781cd66a68d6022229fca0e8e88d8e487c96ee4446d0/2018* $USER@dev.gitlab.org:/var/opt/gitlab/ /var/opt/gitlab/

Perform a DB reindex on the new node (needed because of collations changes in Ubuntu 20.04, see comment)
- this might take 30m or longer - if you see statement timeouts try again with increased statement timeout at the beginning of /var/tmp/indexing_cmds.sql
```
# as root
gitlab-ctl start postgresql
gitlab-ctl status
gitlab-psql -a -f /var/tmp/indexing_cmds.sql
```
Perform DB integrety check - we want an exit code 0 - any error is a sign that rysnc is not complete, or we need to consider rolling back

time sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -h /var/opt/gitlab/postgresql -d gitlabhq_production --jobs 32 -Fd -f /tmp/parallel_dump/

Start services on the new dev.gitlab.org node

gitlab-ctl start
gitlab-ctl status

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 30

Remove alert silences
Check for new alerts
Unpause auto-deploys and make sure the next pipeline succeeds /chatops run auto_deploy unpause
Validate repo mirroring is working - /chatops run mirror status
Announce that this change is complete on https://gitlab.slack.com/archives/C0259241C/p1643278197350200

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 180

ssh onto the new dev.gitlab.org node and make sure all services are stopped on that

ssh 34.139.135.192
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status

ssh onto the old dev.gitlab.org node and stop all services running on the machine

ssh 20.122.91.77
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status

Perform a sync of /var/opt/gitlab from the new dev node to the old dev node, but before that remove all pgsql data


sudo -E time rsync --rsync-path="sudo rsync" -av --delete --exclude 'lost+found' --exclude gitlab-rails/shared/artifacts/75/59/7559ca4a957c8c82ba04781cd66a68d6022229fca0e8e88d8e487c96ee4446d0/2018* /var/opt/gitlab/ $USER@20.122.91.77:/var/opt/gitlab/

Perform a DB reindex on the old node
- this might take 30m or longer - if you see statement timeouts try again with increased statement timeout at the beginning of /var/tmp/indexing_cmds.sql
```
# as root
gitlab-ctl start postgresql
gitlab-ctl status
gitlab-psql -a -f /var/tmp/indexing_cmds.sql
```
Start services on the old dev.gitlab.org node

gitlab-ctl start
gitlab-ctl status

Revert MR swapping IP address for dev.gitlab.org
Validate we see the old IP address, this may take up to 60 seconds to propagate AFTER the above MR is applied
- dig dev.gitlab.org - we should have a response that contains our old IP address 20.122.91.77
Validate the old instance is usable
- Open a browser pointed to dev.gitlab.org
- ssh git@dev.gitlab.org - This should have a success message, example: Welcome to GitLab, $USER!
- Run a pipeline, example: https://dev.gitlab.org/skarbek/test0/-/pipelines/new - this is for a sanity check
Merge MR TBD to lower Kubernetes HPA to our standard configuration
Complete Post Change Steps

Monitoring

Key metrics to observe

n/a - we do not keep specific metrics on this instance that would be helpful, testing throughout the procedure is built in

Monitor for Errors in Sentry: https://sentry.gitlab.net/gitlab/devgitlaborg/?query=server_name%3A%22dev-1-01-sv-dev-1%22

Monitor Host stats: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=ops&var-node=dev-1-01-sv-dev-1.c.gitlab-dev-1.internal

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Jan 28, 2022 by Graeme Gillies