Increase concurrent and limit to 1100 for srm6 and srm7
Production Change
Change Summary
Increase the concurrent
and limit
to 1100 for shared-runners-manager-6
and shared-runners-manager-7
from 900
. This is part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13277
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @tmaczukin
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @steveazz
- Due Date -
- Time tracking - 20 minutes
- Downtime Component - 0 minutes
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5
-
Set label changein-progress on this issue -
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5460 -
Run apply_to_prod
job
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Force chef-client
onsrm6
andsrm7
:knife ssh -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- 'sudo chef-client'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Check concurrent
value:knife ssh -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- 'sudo grep concurrent /etc/gitlab-runner/config.toml'
-
Check limit
value:knife ssh -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- 'sudo grep limit /etc/gitlab-runner/config.toml'
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5460 -
Run deploy_to_prod
-
Force chef-client
onsrm6
andsrm7
:knife ssh -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- 'sudo chef-client'
-
Check concurrent
value:knife ssh -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- 'sudo grep concurrent /etc/gitlab-runner/config.toml'
-
Check limit
value:knife ssh -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- 'sudo grep limit /etc/gitlab-runner/config.toml'
Monitoring
Key metrics to observe
- Metric: Quotas srm6
- Location: https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=gitlab-ci-plan-free-6-f2de7a
- What changes to this metric should prompt a rollback: Reaching quotas on
Heavy-weight read requests
andRead requests
- Metric: Quotas srm7
- Location: https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=gitlab-ci-plan-free-7-7fe256
- What changes to this metric should prompt a rollback: Reaching quotas on
Heavy-weight read requests
andRead requests
- Metric: CPU Usage
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=17&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=All&var-runner_manager=shared-runners-manager-6.gitlab.com.&var-runner_manager=shared-runners-manager-7.gitlab.com.&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Saturation on the CPU
Summary of infrastructure changes
Does this change introduce new compute instances?Does this change re-size any existing compute instances?Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Tomasz Maczukin