Update shared-runner-manager-7 to create ephemeral VMs inside of gitlab-ci-plan-free-7
Production Change
Change Summary
In https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12863 we created a new project gitlab-ci-plan-free-7
that we are going to use to create the ephemeral VMs for the jobs that are picked up shared-runner-manager-7.gitlab.com
. We need to update the configuration for srm7
to point to the new project.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @steveazz @igorwwwwwwwwwwwwwwwwwwww @tmaczukin
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - @igorwwwwwwwwwwwwwwwwwwww
- Due Date - 2020-03-29 09:00 UTC
- Time tracking - 3 hours
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 120
-
Wrap up #4061 (closed) -
As a GitLab admin go inside of the /runners/admin
page and pauseshared-runner-manager-7.gitlab.com
this is to prevent any accidental pick up of jobs while doing the change. -
Disable chef-client
for srm7:knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client-disable "change-management: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4028"'
-
Stop shared-runner-manager-7.gitlab.com
# Inside of your computer $ ssh shared-runners-manager-7.gitlab.com # Start tmux session $ tmux # Inside of srm7 $ sudo /root/runner_upgrade.sh stop_and_poweroff
This will take around 120 minutes since we have to wait for the node to drain
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 20
-
Change service account attached to shared-runners-manager-7
machine. The service account should besrm7-gitlab-ci-plan-free-7@gitlab-ci-xxx.iam.gserviceaccount.com
-
Merge and deploy the configuration changes: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5248 -
Merge -
Run apply_to_prod
job
-
-
Start shared-runners-manager-7
inside of the GCP console -
Reinstall the runner knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i apt-get install --reinstall gitlab-runner=13.9.0-rc2'
. This might start the service but that's OK because the runner manager is paused -
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client-enable'
-
Run chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client'
-
Disable chef-client
to prevent it from running again:knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client-disable "change-management: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4028"'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30
-
Validate srm7 config.toml
has the expected values-
Pointing to new GCP project - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-project=[^,]*"'
- Expected Value:
"google-project=gitlab-ci-plan-free-7-xxxx
- Command:
-
Google service account specified for new VMS - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-service-account=[^,]*"'
- Expected Value:
"google-service-account=ephemeral-runner@gitlab-ci-plan-free-7-xxxx.iam.gserviceaccount.com"
- Command:
-
VPC network defined - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-network=[^,]*"'
- Expected Value:
"google-network=ephemeral-runners"
- Command:
-
-
Make sure GitLab Runner is running: knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i gitlab-runner start && gitlab-runner status'
-
Start shared-runner-manager-7.gitlab.com
inside of/runners/admin
GitLab.com page -
Make sure that the picked-up job runs successfully -
Get job link: knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i curl http://127.0.0.1:9402/debug/jobs/list'
to get the job. You will need anadmin
to see the job since it might be a private project. -
Validate that the machine is created inside of gitlab-ci-plan-free-7
GCP project and the name of the machine matches the name inside of the job trace -
Validate that the machine has the correct labels which are used for billing -
runner_manager_group:shared-runners-manager
-
runner_manager_name:shared-runners-manager-7
-
-
When the job has finished the machine is deleted -
Check that we still have 10 idle machines
-
-
Enable chef-client
:knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client-enable'
-
Run chef-client
: `knife ssh -afqdn 'roles:gitlab-runner-srm' -- 'sudo -i chef-client' -
Update the concurrent
andIdleCount
to200
-
Merge and apply_to_prod
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5265 -
Bake for 1 hour -
Monitor: #4028 (comment 540076614)
-
-
Update the concurrent
andIdleCount
to400
-
Merge and apply_to_prod
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5268 -
Bake for 1 hour -
Monitor: #4028 (comment 540204419)
-
-
Update the concurrent
andIdleCount
to650
-
Merge and apply_to_prod
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5269 -
Bake for 24 hour -
Monitor: #4028 (comment 540652021)
-
-
Update the concurrent
andIdleCount
to750
-
Merge and apply_to_prod
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5274 -
Bake for 24 hour -
Monitor
-
-
Update the concurrent and idle machines back to 800
to be the original value-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5255 -
Run apply_to_prod
-
Run chef-client: knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client-enable' -
Run chef-client: knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo -i chef-client'
-
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 130
-
As a GitLab admin go inside of the /runners/admin
page and pauseshared-runner-manager-7.gitlab.com
this is to prevent any accidental pick up of jobs while doing the change. -
Stop shared-runner-manager-7.gitlab.com
# Inside of your computer $ ssh shared-runners-manager-7.gitlab.com # Start tmux $ tmux # Inside of srm7 $ sudo /root/runner_upgrade.sh stop_and_poweroff
-
Change service account attached to shared-runners-manager-7
machine. The service account should be the defaultcompute@developer.gserviceaccount.com
account. -
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5248 -
merge -
Run apply_to_prod
-
-
Start the machine -
Inside of the shared-runner-manager-7.gitlab.com
machine runsudo /root/runner_upgrade.sh update
# Inside of your computer $ ssh shared-runners-manager-7.gitlab.com # Reinstall gitlab-runner $ apt-get install --reinstall gitlab-runner=13.9.0-rc2 # Inside of srm7 $ sudo /root/runner_upgrade.sh update
-
Validate that the old working config.toml
is there-
Pointing to old GCP project - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-project=[^,]*"'
- Expected Value:
"google-project=gitlab-ci-xxx
- Command:
-
Google service account not specified for new VMS - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-service-account=[^,]*"'
- Expected Value:
- Command:
-
VPC network defined - Command:
knife ssh -afqdn 'roles:gitlab-runner-srm7' -- 'sudo cat /etc/gitlab-runner/config.toml | grep -Po "\"google-network=[^,]*"'
- Expected Value:
- Command:
-
-
Make sure GitLab Runner is running: gitlab-runner start && gitlab-runner status
-
Unpause shared-runner-manager-7.gitlab.com
inside of/admin/runners
GitLab.com UI
Monitoring
Key metrics to observe
- Metric: ci-runner apdex score
- Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: A drop in the apdex score
- GCP console for
gitlab-ci-plan-free-7
project: Quotas- Location: GCP console search for quotas
- Metric: Jobs running by
shared-runner-manager-7
- Location: https://dashboards.gitlab.net/d/ci-runners-deployment/ci-runners-deployment-overview?viewPanel=8&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=All&var-runner_manager=shared-runners-manager-7.gitlab.com.*&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Not picking up jobs
- Metrics: System failures for
shared-runner-manager-7
- Location: https://dashboards.gitlab.net/d/ci-runners-deployment/ci-runners-deployment-overview?viewPanel=9&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=All&var-runner_manager=shared-runners-manager-7.gitlab.com.*&var-runner_job_failure_reason=runner_system_failure
- What changes to this metric should prompt a rollback: Large sustained spike
- Logs: Error level logs for srm7
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. #4028 (comment 537496385) -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.