Reduce number of Idle machines to 200 from 400
Production Change - Criticality 3 C3
| Change Objective | Describe the objective of the change |
|---|---|
| Change Type | ConfigurationChange |
| Services Impacted | GitLab.com CI services - Shared Runners |
| Change Team Members | @steveazz |
| Change Criticality | C3 |
| Change Reviewer or tested in staging | @tmaczukin will review chef changes https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3608 |
| Due Date | 2020-06-01 12:00 UTC |
| Time tracking | ~20 Minutes for chef-client to propagate and idle machines to be deleted. |
Context: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10151
Detailed steps for the change
-
pre-conditions for execution of the step - how to verify it is safe to proceed
-
No Production incidents on going -
CI Apdex score is now below SLO -
Merge request has been reviewed
-
-
execution commands for the step - what to do
-
Merge: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3608 -
Run https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/jobs/1250877 -
Wait for chef-clientto run on each host
-
-
post-execution validation for the step - how to verify the step succeeded
-
Run knife ssh -C 1 -afqdn 'roles:gitlab-runner-srm' -- 'sudo cat /etc/gitlab-runner/config.toml | grep "IdleCount"'and the IdleCount of each shared runner manager should be, ignoring theshared-runners-manager-3.staging.gitlab.comandshared-runners-manager-4.staging.gitlab.comhosts:
IdleCount = 200 IdleCount = 70 IdleCount = 700 -
- Note relevant graphs in grafana to monitor the effect of the change, including how to identify that it has worked, or has caused undue negative effects
- CI Apdex score when this change happens we shouldn't see any dip in the apdex score
- Idle machines for CI. When this change happens we should see the number of idle machine to dorp.
Rollback steps
Revert: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3608
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no open issues labeled as ServiceMonitoring with severities of ~S1 or ~S2
Edited by Steve Xuereb