Update shared runners manager GCP label configuration
Production Change
Change Summary
As part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11391 we are rolling out a new version of the runner configuration file. Its purpose is to help cost tracking by labeling ephemeral GCP instances created to run builds
Change Details
-
Services Impacted -
shared-runners-manager-*
- Change Technician - @steveazz
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - @steveazz
- Due Date - 2020-10-27 15:30 UTC
- Time tracking - 30
- Downtime Component - no downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 20
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4457 -
Wait for the master pipeline to finish -
Run manual apply_to_prod
job
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 10
-
Run chef-client on affected roles knife ssh -C 2 -afqdn roles:gitlab-runner-srm -- 'sudo -i chef-client'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 20
-
Check if docker-version was updated knife ssh -C 2 -afqdn roles:gitlab-runner-srm -- 'sudo cat /etc/gitlab-runner/config.toml | grep -oP "\"google-label=[a-zA-Z_\-:0-9]+\""'
-
Confirm that the output includes the new labels: google-label=gl_resource_type:ci_ephemeral google-label=runner_manager_group:shared-runners-manager google-label=runner_manager_name:shared-runners-manager-3
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 15
-
Rollback the merge request https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4457 -
Wait for the master pipeline to finish -
Trigger apply_to_prod
job. -
Run chef-client on affected roles
knife ssh -C 2 -afqdn roles:gitlab-runner-srm -- 'sudo -i chef-client'
Monitoring
Key metrics to observe
- Metric: docker-machine operations
- Location: https://dashboards.gitlab.net/d/000000159/ci?viewPanel=3&orgId=1&from=1601547174928&to=1601557974928&var-runner_type=All&var-runner_managers=All&var-gitlab_env=gprd&var-gl_monitor_fqdn=All&var-has_minutes=yes&var-runner_job_failure_reason=All&var-jobs_running_for_project=0&var-runner_request_endpoint_status=All
- What changes to this metric should prompt a rollback: No state change, which means that the
docker-machine
binary is not working because no state is chaging
- Metric: Error rates
- Location: https://dashboards.gitlab.net/d/000000159/ci?viewPanel=49&orgId=1&from=1601547283953&to=1601558083953&var-runner_type=All&var-runner_managers=All&var-gitlab_env=gprd&var-gl_monitor_fqdn=All&var-has_minutes=yes&var-runner_job_failure_reason=All&var-jobs_running_for_project=0&var-runner_request_endpoint_status=All
- What changes to this metric should prompt a rollback: Spike in the
error
value
- Metrics: docker-machine logs
- Location: https://log.gprd.gitlab.net/goto/957eafd85b394a4581415ca09b5acd67
- What changes to this metric should prompt a rollback: Spike in error level logs.
Summary of infrastructure changes
The runner config.toml
will be updated with new google-label=...
docker-machine options.
Runners managers being effected:
shared-runners-manager-3.gitlab.com
shared-runners-manager-3.staging.gitlab.com
shared-runners-manager-4.gitlab.com
shared-runners-manager-4.staging.gitlab.com
shared-runners-manager-5.gitlab.com
shared-runners-manager-6.gitlab.com
shared-runners-manager-7.gitlab.com
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.