Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Increase limit on srmX to have Idle machines when reaching saturation

Production Change

Change Summary

Increase the limit to be higher than concurrent in srmX so that when we reach saturation on the number of jobs we run we still have Idle machines waiting be assigned new jobs.

Change Details

Services Impacted - ServiceCI Runners
Change Technician - @steveazz
Change Criticality - C4
Change Type - changescheduled
Change Reviewer - @tmaczukin
Due Date - 2021-05-11
Time tracking - 20minutes
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5

Set label changein-progress on this issue
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5517
Run apply_to_prod

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5

Run chef-client on all node: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client'

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 2

Validate that limit has been updated: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i grep 'limit' /etc/gitlab-runner/config.toml'

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5517
Run apply_to_prod
Run chef-client on all node: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client'
Validate that limit has been updated: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i grep 'limit' /etc/gitlab-runner/config.toml'

Monitoring

Key metrics to observe

Metric: GCP Quotas (gitlab-ci-plan-free-X projects)
- Location: https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=gitlab-ci-plan-free-4-3ba81e&pageState=(%22duration%22:(%22groupValue%22:%22PT6H%22,%22customValue%22:null))
- What changes to this metric should prompt a rollback: Constantly hitting rate limits for Heavy-weight read requests and Read requests
Metric: ci-runners apdex
- Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: Drop in apdex
Metric: Autoscaling machines
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-autoscaling/ci-runners-incident-support-autoscaling?viewPanel=14&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-gcp_exporter=shared-runners-manager-3.gitlab.com:9393&var-gcp_project=All&var-gcp_region=All
- What changes to this metric should prompt a rollback: Drop in used

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 12, 2021 by Steve Xuereb

Assignee

Time tracking