2021-09-23: Upgrade OS on Ops to 20.04
Production Change
Change Summary
This change will update ops.GitLab.net
to use Ubuntu 20.04. This is the final piece of &578 (closed).
Change Details
- Services Impacted - ServiceWeb
- Change Technician - @ahanselka
- Change Reviewer - @T4cC0re
- Time tracking - 240
- Downtime Component - 45
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5
-
Create the appropriate silences: -
https://alerts.gitlab.net/#/silences/new?filter=%7Balert_type%3D%22symptom%22%2C%20env%3D%22gprd%22%2C%20environment%3D%22gprd%22%2C%20instance%3D%22https://ops.gitlab.net/users/sign_in%22%2C%20job%3D%22blackbox%22%2C%20monitor%3D%22default%22%2C%20pager%3D%22pagerduty%22%2C%20provider%3D%22gcp%22%2C%20region%3D%22us-east%22%2C%20severity%3D%22s1%22%2C%20shard%3D%22default%22%2C%20stage%3D%22main%22%2C%20tier%3D%22sv%22%2C%20type%3D%22blackbox%22%2C%20alertname%3D%22BlackboxProbeFailuresLong%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Baggregation%3D%22component%22%2C%20alert_type%3D%22symptom%22%2C%20component%3D%22polling%22%2C%20env%3D%22ops%22%2C%20environment%3D%22ops%22%2C%20feature_category%3D%22runner%22%2C%20monitor%3D%22global%22%2C%20pager%3D%22pagerduty%22%2C%20product_stage%3D%22verify%22%2C%20product_stage_group%3D%22runner%22%2C%20rules_domain%3D%22general%22%2C%20severity%3D%22s2%22%2C%20sli_type%3D%22error%22%2C%20slo_alert%3D%22yes%22%2C%20stage%3D%22main%22%2C%20tier%3D%22runners%22%2C%20type%3D%22ci-runners%22%2C%20user_impacting%3D%22yes%22%2C%20alertname%3D%22CiRunnersServicePollingErrorSLOViolation%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Baggregation%3D%22component%22%2C%20alert_type%3D%22symptom%22%2C%20component%3D%22gitlab_net_zone%22%2C%20env%3D%22gprd%22%2C%20environment%3D%22gprd%22%2C%20feature_category%3D%22not_owned%22%2C%20monitor%3D%22global%22%2C%20pager%3D%22pagerduty%22%2C%20product_stage%3D%22not_owned%22%2C%20product_stage_group%3D%22not_owned%22%2C%20rules_domain%3D%22general%22%2C%20severity%3D%22s2%22%2C%20sli_type%3D%22error%22%2C%20slo_alert%3D%22yes%22%2C%20stage%3D%22main%22%2C%20tier%3D%22lb%22%2C%20type%3D%22waf%22%2C%20user_impacting%3D%22no%22%2C%20alertname%3D%22WafServiceGitlabNetZoneErrorSLOViolation%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Baggregation%3D%22component%22%2C%20alert_type%3D%22symptom%22%2C%20component%3D%22polling%22%2C%20env%3D%22ops%22%2C%20environment%3D%22ops%22%2C%20feature_category%3D%22runner%22%2C%20monitor%3D%22global%22%2C%20pager%3D%22pagerduty%22%2C%20product_stage%3D%22verify%22%2C%20product_stage_group%3D%22runner%22%2C%20rules_domain%3D%22general%22%2C%20severity%3D%22s2%22%2C%20sli_type%3D%22error%22%2C%20slo_alert%3D%22yes%22%2C%20stage%3D%22main%22%2C%20tier%3D%22runners%22%2C%20type%3D%22ci-runners%22%2C%20user_impacting%3D%22yes%22%2C%20alertname%3D%22CiRunnersServicePollingErrorSLOViolation%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Baggregation%3D%22component%22%2C%20alert_type%3D%22symptom%22%2C%20component%3D%22gitlab_net_zone%22%2C%20env%3D%22gprd%22%2C%20environment%3D%22gprd%22%2C%20feature_category%3D%22not_owned%22%2C%20monitor%3D%22global%22%2C%20pager%3D%22pagerduty%22%2C%20product_stage%3D%22not_owned%22%2C%20product_stage_group%3D%22not_owned%22%2C%20rules_domain%3D%22general%22%2C%20severity%3D%22s2%22%2C%20sli_type%3D%22error%22%2C%20slo_alert%3D%22yes%22%2C%20stage%3D%22main%22%2C%20tier%3D%22lb%22%2C%20type%3D%22waf%22%2C%20user_impacting%3D%22no%22%2C%20alertname%3D%22WafServiceGitlabNetZoneErrorSLOViolation%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Balert_type%3D%22cause%22%2C%20env%3D%22ops%22%2C%20environment%3D%22ops%22%2C%20geo_role%3D%22primary%22%2C%20job%3D%22gitlab-redis%22%2C%20monitor%3D%22default%22%2C%20pager%3D%22pagerduty%22%2C%20provider%3D%22gcp%22%2C%20region%3D%22us-east%22%2C%20severity%3D%22s1%22%2C%20shard%3D%22default%22%2C%20stage%3D%22main%22%2C%20tier%3D%22inf%22%2C%20type%3D%22gitlab%22%2C%20alertname%3D%22RedisMasterMissing%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Balert_type%3D%22symptom%22%2C%20env%3D%22gprd%22%2C%20environment%3D%22gprd%22%2C%20instance%3D%22https://registry.ops.gitlab.net%22%2C%20job%3D%22blackbox%22%2C%20monitor%3D%22default%22%2C%20pager%3D%22pagerduty%22%2C%20provider%3D%22gcp%22%2C%20region%3D%22us-east%22%2C%20severity%3D%22s1%22%2C%20shard%3D%22default%22%2C%20stage%3D%22main%22%2C%20tier%3D%22sv%22%2C%20type%3D%22blackbox%22%2C%20alertname%3D%22BlackboxProbeFailures%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Balert_type%3D%22cause%22%2C%20env%3D%22ops%22%2C%20environment%3D%22ops%22%2C%20fqdn%3D%22gitlab-01-inf-ops.c.gitlab-ops.internal%22%2C%20geo_role%3D%22primary%22%2C%20instance%3D%22gitlab-01-inf-ops.c.gitlab-ops.internal:9121%22%2C%20job%3D%22gitlab-redis%22%2C%20monitor%3D%22default%22%2C%20pager%3D%22pagerduty%22%2C%20provider%3D%22gcp%22%2C%20region%3D%22us-east%22%2C%20severity%3D%22s1%22%2C%20shard%3D%22default%22%2C%20stage%3D%22main%22%2C%20tier%3D%22inf%22%2C%20type%3D%22gitlab%22%2C%20alertname%3D%22RedisInstanceDown%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Baggregation%3D%22component%22%2C%20alert_class%3D%22traffic_cessation%22%2C%20alert_type%3D%22cause%22%2C%20component%3D%22polling%22%2C%20env%3D%22ops%22%2C%20environment%3D%22ops%22%2C%20feature_category%3D%22runner%22%2C%20monitor%3D%22global%22%2C%20pager%3D%22pagerduty%22%2C%20product_stage%3D%22verify%22%2C%20product_stage_group%3D%22runner%22%2C%20rules_domain%3D%22general%22%2C%20severity%3D%22s2%22%2C%20sli_type%3D%22ops%22%2C%20slo_alert%3D%22no%22%2C%20stage%3D%22main%22%2C%20tier%3D%22runners%22%2C%20type%3D%22ci-runners%22%2C%20user_impacting%3D%22yes%22%2C%20alertname%3D%22CiRunnersServicePollingTrafficAbsent%22%7D -
https://alerts.gitlab.net/#/silences/new?filter=%7Balert_type%3D%22cause%22%2C%20alertname%3D%22ChefClientErrorCritical%22%2C%20env%3D%22gstg%22%2C%20monitor%3D%22default%22%2C%20pager%3D%22pagerduty%22%2C%20provider%3D%22gcp%22%2C%20region%3D%22us-east%22%2C%20severity%3D%22s1%22%2C%20type%3D%22pgbouncer%22%7D
-
-
Announce the downtime in #infrastructure-loung and #releases. Ping @prod-eng-team
and@release-managers
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 45
-
Create a snapshot of the data disk: gcloud compute disks snapshot gitlab-01-inf-ops-data --project=gitlab-ops --snapshot-names=gitlab-01-inf-ops-data-ubuntu-upgrade-backup --zone=us-east1-c --storage-location=us
-
Create a image of the boot disk: gcloud compute images create gitlab-ops-16-04-backup --source-disk=gitlab-01-inf-ops --project=gitlab-ops --zone=us-east1-c --storage-location=us
-
Mark the ops VM for rebuild: tf taint "module.gitlab-ops.google_compute_instance.default[0]"
-
Apply the change: tf apply
-
Wait for the node to get back online. Verify that it's running correctly by trying to open a gitlab-rails console
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Expire the silences created earlier -
Delete the disk snapshot created earlier -
Delete the disk image created earlier
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 45
-
Change image of the ops instance to the one created in step -
Mark the ops VM for rebuild: tf taint "module.gitlab-ops.google_compute_instance.default[0]"
-
Apply the change: tf apply
-
Wait for the node to get back online. Verify that it's running correctly by trying to open a gitlab-rails console
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Alex Hanselka