Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Increase concurrent to 1200 for srmX

Production Change

Change Summary

Increase concurrent from 1100 to 1200 for srmX so we add an extra 500 jobs that we can run concurrently.

Change Details

Services Impacted - ServiceCI Runners
Change Technician - @steveazz
Change Criticality - C4
Change Type - changescheduled
Change Reviewer - @tmaczukin
Due Date - 2021-05-12
Time tracking - 20
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 0

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5

Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5518
Run apply_to_prod job
Run chef-client: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client'

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5

Check concurrent value: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i grep 'concurrent' /etc/gitlab-runner/config.toml'.
- Expected value: 1200

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10

Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5518
Run apply_to_prod job
Run chef-client: knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client'

Monitoring

Key metrics to observe

Metric: Apdex
- Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: Drop in apdex
Metric: GCP Quotas
- Location:
- What changes to this metric should prompt a rollback: Hitting rate limits
Metric: CPU usage
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=17&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Reaching CPU saturation

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 12, 2021 by Steve Xuereb

Assignee

Time tracking