Increase concurrent to 900 for shared-runners-manager-7
Production Change
Change Summary
Increase the number of concurrent jobs to 900
for shared-runners-manager-7
as part of increasing the shared runner capacity in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13277
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @steveazz
- Change Criticality - C4
- Change Type - changescheduled
- Change Reviewer - @igorwwwwwwwwwwwwwwwwwwww
- Due Date - 2021-05-06
- Time tracking - 10 minutes
- Downtime Component - 0 minutes
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 0
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5436 -
Run apply_to_prod
on merge pipeline https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/jobs/3793515 -
Force chef-client: knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 1
-
Check concurrent
valueknife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo grep 'concurrent' /etc/gitlab-runner/config.toml'
- Expected value
900
-
Check limit
valueknife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo grep 'limit' /etc/gitlab-runner/config.toml'
- Expected value
900
-
Check IdleCount
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo grep 'IdleCount' /etc/gitlab-runner/config.toml'
- 3 hits:
800
,100
,800
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5436 -
Run apply_to_prod
on merge pipeline -
Force chef-client: knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client'
Monitoring
Key metrics to observe
- Metric: API Quotas
- Location: https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=gitlab-ci-plan-free-7-7fe256
- What changes to this metric should prompt a rollback: Start hitting rate limits for
Heavy-weight read requests
andRead requests
constantly (1 time every few hours is fine)
- Metric: CPU usage of runner manager
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=17&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=All&var-runner_manager=shared-runners-manager-7.gitlab.com.*&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Saturation on CPU
Summary of infrastructure changes
Does this change introduce new compute instances?Does this change re-size any existing compute instances?Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Steve Xuereb