Cutover dev.gitlab.org from Azure to GCP
Production Change
Change Summary
Performs the cutover of dev.gitlab.org from running on azure to running on GCP
For epic &389 (closed)
Change Details
- Services Impacted - ServiceUnknown
-
Change Technician -
@ggillies - Change Reviewer - @hphilipps
- Time tracking - 240 minutes
- Downtime Component - YES
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5 mins
-
Set label changein-progress on this issue -
Announce that this change is starting on https://gitlab.slack.com/archives/C0259241C/p1643278197350200 -
Confirm Omnibus nightly builds are inactive - https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipeline_schedules -
Confirm Auto-Deploy is paused /chatops run auto_deploy pause -
Confirm there are no running deployments -
Set silence on alerts for dev.gitlab.org (match fqdn="dev.gitlab.org"): https://alerts.gitlab.net/#/silences/new
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 200
-
Merge MR and applied for locking HPA's -
Merge and apply MR for decreasing the DNS TTL of dev.gitlab.org -
ssh onto the new dev.gitlab.orgnode and make sure all services are stopped on that
ssh 34.139.135.192
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
-
ssh onto the old dev.gitlab.org node and stop all services running on the machine
ssh dev.gitlab.org
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
-
Do a test db dump on the source machine to validate no db corruption in source
time sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -h /var/opt/gitlab/postgresql -d gitlabhq_production --jobs 32 -Fd -f /tmp/parallel_dump/
-
Perform a final and complete sync of /var/opt/gitlabfrom the old dev node to the new dev node, removing postsql data first
sudo su -
rm -rf /var/opt/gitlab/postgresql/data
exit
sudo -E time rsync --rsync-path="sudo rsync" -av --delete --exclude 'lost+found' --exclude gitlab-rails/shared/artifacts/75/59/7559ca4a957c8c82ba04781cd66a68d6022229fca0e8e88d8e487c96ee4446d0/2018* $USER@dev.gitlab.org:/var/opt/gitlab/ /var/opt/gitlab/
-
Perform a DB reindex on the new node (needed because of collations changes in Ubuntu 20.04, see comment) - this might take 30m or longer - if you see statement timeouts try again with increased statement timeout at the beginning of /var/tmp/indexing_cmds.sql
# as root gitlab-ctl start postgresql gitlab-ctl status gitlab-psql -a -f /var/tmp/indexing_cmds.sql -
Perform DB integrety check - we want an exit code 0- any error is a sign that rysnc is not complete, or we need to consider rolling back
time sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -h /var/opt/gitlab/postgresql -d gitlabhq_production --jobs 32 -Fd -f /tmp/parallel_dump/
-
Start services on the new dev.gitlab.org node
gitlab-ctl start
gitlab-ctl status
-
Validate the GitLab service is accessible, use curl considering redirects of the browser -
curl -v http://34.139.135.192- We should see a 3xx status message for the redirect tohttps://dev.gitlab.org -
curl -vk https://34.139.135.192- We should see a 3xx status message for the redirect tohttps://34.139.135.192/users/sign_in -
ssh git@34.139.135.192- This should have a success message, example:Welcome to GitLab, $USER!
-
-
Merge MR swapping IP address for dev.gitlab.org -
Validate we see the new IP address, this may take up to 60 seconds to propogate AFTER the above MR is applied -
dig dev.gitlab.org- we should have a response that contains our new IP address34.139.135.192
-
-
Validate the instance is usable -
Open a browser pointed to dev.gitlab.org -
Run a pipeline, example: https://dev.gitlab.org/skarbek/test0/-/pipelines/new - this is for a sanity check
-
-
Revert MR to lower Kubernetes HPA to standard configuration
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30
-
Remove alert silences -
Check for new alerts -
Unpause auto-deploys and make sure the next pipeline succeeds /chatops run auto_deploy unpause -
Validate repo mirroring is working - /chatops run mirror status -
Announce that this change is complete on https://gitlab.slack.com/archives/C0259241C/p1643278197350200
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 180
-
ssh onto the new dev.gitlab.org node and make sure all services are stopped on that
ssh 34.139.135.192
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
-
ssh onto the old dev.gitlab.org node and stop all services running on the machine
ssh 20.122.91.77
sudo su -
chef-client-disable 'gl-infra/production/-/issues/6236'
gitlab-ctl stop
gitlab-ctl status
-
Perform a sync of /var/opt/gitlabfrom the new dev node to the old dev node, but before that remove all pgsql data
sudo -E time rsync --rsync-path="sudo rsync" -av --delete --exclude 'lost+found' --exclude gitlab-rails/shared/artifacts/75/59/7559ca4a957c8c82ba04781cd66a68d6022229fca0e8e88d8e487c96ee4446d0/2018* /var/opt/gitlab/ $USER@20.122.91.77:/var/opt/gitlab/
-
Perform a DB reindex on the old node - this might take 30m or longer - if you see statement timeouts try again with increased statement timeout at the beginning of /var/tmp/indexing_cmds.sql
# as root gitlab-ctl start postgresql gitlab-ctl status gitlab-psql -a -f /var/tmp/indexing_cmds.sql -
Start services on the old dev.gitlab.org node
gitlab-ctl start
gitlab-ctl status
-
Revert MR swapping IP address for dev.gitlab.org -
Validate we see the old IP address, this may take up to 60 seconds to propagate AFTER the above MR is applied -
dig dev.gitlab.org- we should have a response that contains our old IP address20.122.91.77
-
-
Validate the old instance is usable -
Open a browser pointed to dev.gitlab.org -
ssh git@dev.gitlab.org- This should have a success message, example:Welcome to GitLab, $USER! -
Run a pipeline, example: https://dev.gitlab.org/skarbek/test0/-/pipelines/new - this is for a sanity check
-
-
Merge MR TBD to lower Kubernetes HPA to our standard configuration -
Complete Post Change Steps
Monitoring
Key metrics to observe
n/a - we do not keep specific metrics on this instance that would be helpful, testing throughout the procedure is built in
Monitor for Errors in Sentry: https://sentry.gitlab.net/gitlab/devgitlaborg/?query=server_name%3A%22dev-1-01-sv-dev-1%22
Monitor Host stats: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=ops&var-node=dev-1-01-sv-dev-1.c.gitlab-dev-1.internal
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Graeme Gillies