Migrate docker-machine to use wait for machine create on shared-runners-manager-7
Production Change
Change Summary
As part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13277 migrate to use wait
when we call docker-machine create
. To migrate to wait
we need to upgrade the docker-machine
version to v0.16.2-gitlab.12
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @steveazz
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @igorwwwwwwwwwwwwwwwwwwww
- Due Date - 2020-05-06
- Time tracking - 10 minutes
- Downtime Component - 0 minutes
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5432 -
Run apply_to_prod
manual job https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/jobs/3793224 -
Force chef-client
run:knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i docker-machine --version'
-
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sha256sum /usr/bin/docker-machine'
: Expected0c9c659318fabe54cff460b6bb1d92f2a987dffd7de03d6ccc17d6449d4871f9
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10
-
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5432 -
Run apply_to_prod
manual job -
Force chef-client
run:knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client'
-
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i docker-machine --version'
-
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sha256sum /usr/bin/docker-machine'
: Expected75522b4a816c81b130e7fb6f07121c1d5ea4165c4df5fbf05663eac88b797f02
Monitoring
Key metrics to observe
- Metric: Quotas
- Location: https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=gitlab-ci-plan-free-7-7fe256
- What changes to this metric should prompt a rollback: Start hitting limits on
Read requests
,Heavy-weight read requests
andOperation read requests
- Metric: srm7 running jobs
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=26&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-runner_job_failure_reason=All&from=1620214140000&to=1620224999999
- What changes to this metric should prompt a rollback: Drop-in running jobs
- Metric: runner errors
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=23&var-PROMETHEUS_DS=&var-environment=gprd&var-stage=&var-shard=All&var-runner_manager=shared-runners-manager-7.gitlab.com.*&var-jobs_running_for_project=All&var-runner_job_failure_reason=All&from=1620214620000&to=1620225479999&orgId=1
- What changes to this metric should prompt a rollback: Increase in
runner_system_failure
Summary of infrastructure changes
Does this change introduce new compute instances?Does this change re-size any existing compute instances?Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. 👉 #4484 (comment 568476401) -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Steve Xuereb