Change definition of google-project for srm to prepare for override
Production Change
Change Summary
In #4028 (closed) we are going to change srm7
to point to a different GCP project to create emphermal VMs in. Our chef roles require a change where we define google-project
so it can be overridden on a per-host basis.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @steveazz
- Change Criticality - C3,
- Change Type - changescheduled
- Change Reviewer - @tmaczukin
- Due Date - 2021-03-29 06:15 UTC
- Time tracking - 10min
- Downtime Component - 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1
-
Get https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5250 reviewed: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5250#note_101799 -
Disable chef-client
:knife ssh -C2 -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client-disable "change-management: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4061"'
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5250 -
Run apply_to_prod
job -
knife ssh -C2 -afqdn 'roles:gitlab-runner-stg-srm' -- 'sudo -i chef-client'
-
Verify that the project is still defined: knife ssh -afqdn 'roles:gitlab-runner-stg-srm' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-project=[^,]*"'
-
Verify that the project is still defined: knife ssh -afqdn 'roles:gitlab-runner-stg-srm' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-subnetwork=[^,]*"'
-
Run chef on every node: knife ssh -C2 -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client''
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 2
-
knife ssh -C2 -afqdn 'roles:gitlab-runner-srm' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-project=[^,]*"'
-
Verify that the project is still defined: knife ssh -afqdn 'roles:gitlab-runner-stg-srm' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-subnetwork=[^,]*"'
-
Verify configuration reloaded: knife ssh -C2 -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i journalctl -u gitlab-runner | grep "Configuration loaded"'
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert and apply_to_prod
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5250 -
Force chef-client
knife ssh -C2 -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client''
Monitoring
Key metrics to observe
- Metric: Apdex
- Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&viewPanel=79474957
- What changes to this metric should prompt a rollback: Droping
- Metric: Autoscale machines
- Location: https://dashboards.gitlab.net/d/000000159/ci?viewPanel=3&orgId=1&var-shard=shared&var-runner_type=All&var-runner_managers=All&var-gitlab_env=gprd&var-gl_monitor_fqdn=All&var-has_minutes=yes&var-runner_job_failure_reason=All&var-jobs_running_for_project=0&var-runner_request_endpoint_status=All
- What changes to this metric should prompt a rollback: No
used
orcreating
machines
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Steve Xuereb