Cutover dev.gitlab.org from Azure to GCP

Production Change

Change Summary

Performs the cutover of dev.gitlab.org from running on azure to running on GCP

For epic &389 (closed)

Change Details

  1. Services Impacted - ServiceUnknown
  2. Change Technician - @ggillies
  3. Change Reviewer - @hphilipps
  4. Time tracking - 240 minutes
  5. Downtime Component - YES

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 mins

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 200

  • Merge MR and applied for locking HPA's
  • Merge and apply MR for decreasing the DNS TTL of dev.gitlab.org
  • ssh onto the new dev.gitlab.org node and make sure all services are stopped on that
ssh 34.139.135.192
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
  • ssh onto the old dev.gitlab.org node and stop all services running on the machine
ssh dev.gitlab.org
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
  • Do a test db dump on the source machine to validate no db corruption in source
time sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -h /var/opt/gitlab/postgresql -d gitlabhq_production --jobs 32 -Fd -f /tmp/parallel_dump/
  • Perform a final and complete sync of /var/opt/gitlab from the old dev node to the new dev node, removing postsql data first
sudo su -
rm -rf /var/opt/gitlab/postgresql/data
exit
sudo -E time rsync --rsync-path="sudo rsync" -av --delete --exclude 'lost+found' --exclude gitlab-rails/shared/artifacts/75/59/7559ca4a957c8c82ba04781cd66a68d6022229fca0e8e88d8e487c96ee4446d0/2018* $USER@dev.gitlab.org:/var/opt/gitlab/ /var/opt/gitlab/
  • Perform a DB reindex on the new node (needed because of collations changes in Ubuntu 20.04, see comment)
    • this might take 30m or longer - if you see statement timeouts try again with increased statement timeout at the beginning of /var/tmp/indexing_cmds.sql
    # as root
    gitlab-ctl start postgresql
    gitlab-ctl status
    gitlab-psql -a -f /var/tmp/indexing_cmds.sql
  • Perform DB integrety check - we want an exit code 0 - any error is a sign that rysnc is not complete, or we need to consider rolling back
time sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -h /var/opt/gitlab/postgresql -d gitlabhq_production --jobs 32 -Fd -f /tmp/parallel_dump/ 
  • Start services on the new dev.gitlab.org node
gitlab-ctl start
gitlab-ctl status
  • Validate the GitLab service is accessible, use curl considering redirects of the browser
    • curl -v http://34.139.135.192 - We should see a 3xx status message for the redirect to https://dev.gitlab.org
    • curl -vk https://34.139.135.192 - We should see a 3xx status message for the redirect to https://34.139.135.192/users/sign_in
    • ssh git@34.139.135.192 - This should have a success message, example: Welcome to GitLab, $USER!
  • Merge MR swapping IP address for dev.gitlab.org
  • Validate we see the new IP address, this may take up to 60 seconds to propogate AFTER the above MR is applied
    • dig dev.gitlab.org - we should have a response that contains our new IP address 34.139.135.192
  • Validate the instance is usable
  • Revert MR to lower Kubernetes HPA to standard configuration

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 30

  • Remove alert silences
  • Check for new alerts
  • Unpause auto-deploys and make sure the next pipeline succeeds /chatops run auto_deploy unpause
  • Validate repo mirroring is working - /chatops run mirror status
  • Announce that this change is complete on https://gitlab.slack.com/archives/C0259241C/p1643278197350200

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 180

  • ssh onto the new dev.gitlab.org node and make sure all services are stopped on that
ssh 34.139.135.192
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
  • ssh onto the old dev.gitlab.org node and stop all services running on the machine
ssh 20.122.91.77
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
  • Perform a sync of /var/opt/gitlab from the new dev node to the old dev node, but before that remove all pgsql data

sudo -E time rsync --rsync-path="sudo rsync" -av --delete --exclude 'lost+found' --exclude gitlab-rails/shared/artifacts/75/59/7559ca4a957c8c82ba04781cd66a68d6022229fca0e8e88d8e487c96ee4446d0/2018* /var/opt/gitlab/ $USER@20.122.91.77:/var/opt/gitlab/ 
  • Perform a DB reindex on the old node
    • this might take 30m or longer - if you see statement timeouts try again with increased statement timeout at the beginning of /var/tmp/indexing_cmds.sql
    # as root
    gitlab-ctl start postgresql
    gitlab-ctl status
    gitlab-psql -a -f /var/tmp/indexing_cmds.sql
  • Start services on the old dev.gitlab.org node
gitlab-ctl start
gitlab-ctl status
  • Revert MR swapping IP address for dev.gitlab.org
  • Validate we see the old IP address, this may take up to 60 seconds to propagate AFTER the above MR is applied
    • dig dev.gitlab.org - we should have a response that contains our old IP address 20.122.91.77
  • Validate the old instance is usable
  • Merge MR TBD to lower Kubernetes HPA to our standard configuration
  • Complete Post Change Steps

Monitoring

Key metrics to observe

n/a - we do not keep specific metrics on this instance that would be helpful, testing throughout the procedure is built in

Monitor for Errors in Sentry: https://sentry.gitlab.net/gitlab/devgitlaborg/?query=server_name%3A%22dev-1-01-sv-dev-1%22

Monitor Host stats: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=ops&var-node=dev-1-01-sv-dev-1.c.gitlab-dev-1.internal

Summary of infrastructure changes

  • Does this change introduce new compute instances? No
  • Does this change re-size any existing compute instances? No
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Change Reviewer checklist

C4 C3 C2 C1:

  • The scheduled day and time of execution of the change is appropriate.
  • The change plan is technically accurate.
  • The change plan includes estimated timing values based on previous testing.
  • The change plan includes a viable rollback plan.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
  • The change plan includes success measures for all steps/milestones during the execution.
  • The change adequately minimizes risk within the environment/service.
  • The performance implications of executing the change are well-understood and documented.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
  • The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Graeme Gillies