Update shared-runner-manager-4 to create ephemeral VMs inside of gitlab-ci-plan-free-4
Production Change
Change Summary
Part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13277 we need to migrate shared-runners-manager-5
to create ephemeral VMs for jobs inside of gitlab-ci-plan-free-4
so that we can scale vertically the number of machines that srm5
can handle.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @steveazz
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @igorwwwwwwwwwwwwwwwwwwww
- Due Date - 2020-05-10 11:00 UTC
- Time tracking - 3 hours
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 120
-
As a GitLab admin go inside of the /runners/admin
page and pauseshared-runner-manager-4.gitlab.com
this is to prevent any accidental pick up of jobs while doing the change. -
Disable chef-client
for srm4:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client-disable "change-management: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4546"'
-
Stop shared-runner-manager-4.gitlab.com
# Inside of your computer $ ssh lb-bastion.ci.gitlab.com # Start tmux session $ tmux # Drain srm5 $ knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo /root/runner_upgrade.sh stop' # Delete all idle machines $ knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo ls /root/.docker/machine/machines | xargs -P100 -n1 sudo -H docker-machine rm -f' # Turn off VM $ knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo poweroff'
This will take around 120 minutes since we have to wait for the node to drain
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 20
-
Change service account attached to shared-runners-manager-4
machine. The service account should begitlab-ci-plan-free-4@gitlab-ci-xxxxx.iam.gserviceaccount.com
-
Merge and deploy the configuration changes: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5454 -
Start shared-runners-manager-4
inside of the GCP console -
Reinstall the runner knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i apt-get install --reinstall gitlab-runner=13.11.0-rc1'
. This might start the service but that's OK because the runner manager is paused -
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client-enable'
-
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client'
-
Disable chef-client
to prevent it from running again:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client-disable "change-management: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4546"'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30
-
Validate srm4 config.toml
has the expected values-
Pointing to new GCP project - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-project=[^,]*"'
- Expected Value:
"google-project=gitlab-ci-plan-free-4-xxxx
- Command:
-
Google service account specified for new VMS - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-service-account=[^,]*"'
- Expected Value:
"google-service-account=ephemeral-runner@gitlab-ci-plan-free-4-xxxx.iam.gserviceaccount.com"
- Command:
-
VPC network defined - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-network=[^,]*"'
- Expected Value:
"google-network=ephemeral-runners"
- Command:
-
subnetwork defined - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-subnetwork=[^,]*"'
- Expected Value:
"google-subnetwork=ephemeral-runners"
- Command:
-
-
Validate docker-machine
version onsrm4
-
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i docker-machine --version'
: Expected0.16.2-gitlab.12
-
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sha256sum /usr/bin/docker-machine'
: Expected0c9c659318fabe54cff460b6bb1d92f2a987dffd7de03d6ccc17d6449d4871f9
-
-
Make sure GitLab Runner is running: knife ssh -afqdn 'roles:gitlab-runner-srm5' -- 'sudo -i gitlab-runner start && gitlab-runner status'
-
Start shared-runner-manager-4.gitlab.com
inside of/runners/admin
GitLab.com page -
Make sure that the picked-up job runs successfully -
Get job link: knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i curl http://127.0.0.1:9402/debug/jobs/list'
to get the job. You will need anadmin
to see the job since it might be a private project. -
Validate that the machine is created inside of gitlab-ci-plan-free-4
GCP project and the name of the machine matches the name inside of the job trace -
Validate that the machine has the correct labels which are used for billing -
runner_manager_group:shared-runners-manager
-
runner_manager_name:shared-runners-manager-4
-
-
When the job has finished the machine is deleted -
Check that we still have 10 idle machines
-
-
Enable chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client-enable'
-
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client'
-
Update the concurrent
andIdleCount
to200
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5469 -
Update the concurrent
andIdleCount
to600
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5471 -
Update the concurrent
andIdleCount
to950
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5472 / https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5473 -
Update the concurrent
andIdleCount
to1100
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5475
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 130
-
As a GitLab admin go inside of the /runners/admin
page and pauseshared-runner-manager-4.gitlab.com
this is to prevent any accidental pick up of jobs while doing the change. -
Stop shared-runner-manager-4.gitlab.com
# Inside of your computer $ ssh lb-bastion.ci.gitlab.com # Start tmux session $ tmux # Drain srm5 $ knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo /root/runner_upgrade.sh stop' # Delete all idle machines $ knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo ls /root/.docker/machine/machines | xargs -P100 -n1 sudo -H docker-machine rm -f' # Turn off VM $ knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo /root/runner_upgrade.sh stop_and_poweroff'
-
Change service account attached to shared-runners-manager-5
machine. The service account should be the default62632269664-compute@developer.gserviceaccount.com
account. -
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5454 -
merge -
Run apply_to_prod
-
-
Start the machine -
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client-enable'
-
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo -i chef-client'
-
Validate that the old working config.toml
is there-
Pointing to old GCP project - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-project=[^,]*"'
- Expected Value:
"google-project=gitlab-ci-xxx
- Command:
-
Google service account not specified for new VMS - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-service-account=[^,]*"'
- Expected Value:
- Command:
-
VPC network defined - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm4' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-network=[^,]*"'
- Expected Value:
- Command:
-
-
Make sure GitLab Runner is running: gitlab-runner start && gitlab-runner status
-
Unpause shared-runner-manager-4.gitlab.com
inside of/admin/runners
GitLab.com UI
Monitoring
Key metrics to observe
- Metric: ci-runner apdex score
- Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: A drop in the apdex score
- Metric: GCP API Quotas
- Location: https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=gitlab-ci-plan-free-4-3ba81e
- What changes to this metric should prompt a rollback: Hitting API rate limits
- Metric: Jobs running by
shared-runner-manager-4
- Location: https://dashboards.gitlab.net/d/ci-runners-deployment/ci-runners-deployment-overview?viewPanel=8&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=All&var-runner_manager=shared-runners-manager-4.gitlab.com.*&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Not picking up jobs
- Metrics: System failures for
shared-runner-manager-4
- Location: https://dashboards.gitlab.net/d/ci-runners-deployment/ci-runners-deployment-overview?viewPanel=9&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=All&var-runner_manager=shared-runners-manager-4.gitlab.com.*&var-runner_job_failure_reason=runner_system_failure
- What changes to this metric should prompt a rollback: Large sustained spike
- Logs: Error level logs for srm4
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. #4028 (comment 537496385) -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.