Reduce number of Idle machines to 200 from 400

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com CI services - Shared Runners
Change Team Members @steveazz
Change Criticality C3
Change Reviewer or tested in staging @tmaczukin will review chef changes https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3608
Due Date 2020-06-01 12:00 UTC
Time tracking ~20 Minutes for chef-client to propagate and idle machines to be deleted.

Context: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10151

Detailed steps for the change

  1. pre-conditions for execution of the step - how to verify it is safe to proceed

  2. execution commands for the step - what to do

  3. post-execution validation for the step - how to verify the step succeeded

    • Run knife ssh -C 1 -afqdn 'roles:gitlab-runner-srm' -- 'sudo cat /etc/gitlab-runner/config.toml | grep "IdleCount"' and the IdleCount of each shared runner manager should be, ignoring the shared-runners-manager-3.staging.gitlab.com and shared-runners-manager-4.staging.gitlab.com hosts:
    IdleCount = 200
      IdleCount = 70
      IdleCount = 700
  • Note relevant graphs in grafana to monitor the effect of the change, including how to identify that it has worked, or has caused undue negative effects
    • CI Apdex score when this change happens we shouldn't see any dip in the apdex score
    • Idle machines for CI. When this change happens we should see the number of idle machine to dorp.

Rollback steps

Revert: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3608

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • SRE on-call has been informed prior to change being rolled out
  • There are currently no open issues labeled as ServiceMonitoring with severities of ~S1 or ~S2
Edited by Steve Xuereb