Switch runners to use ci-gateway ILB

Production Change

Change Summary

As part of https://gitlab.com/groups/gitlab-org/-/epics/7212 we're introducing a new networking setup for ServiceCI Runners and executed jobs. This configuration will move the Runner API communication and Git operations done for the purpose of executed jobs off the public Internet to a private networking within GCP.

The detailed description of the plan was discussed at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14874. The issue contains also results of our tests on staging.gitlab.com.

This change will guide us through steps needed to move the GitLab.com workloads executed on ServiceCI Runners to a similar setup.

Change Details

Services Impacted - ServiceCI Runners
Change Technician - @tmaczukin
Change Reviewer - @ahmadsherif
Time tracking - 90 min
Downtime Component - No downtime is expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10

Merge and apply terraform changes from https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3483 (VPC peerings)
Merge and apply terraform changes from https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3484 (firewall updates)
Execute manual tests on one of the private, shared-gitlab-org and shared nodes against the GitLab.com ILB FQDNs to make sure that we have connectivity

Day 1 - `private` runners shard

Estimated Time to Complete (mins) - 20

Set label changein-progress on this issue
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1472+
Wait for changes to be applied through CI
Wait for changes to be applied by chef-client on the active private nodes (or force it by manual run of chef-client)
Run test pipeline in https://gitlab.com/gitlab-org/test-git.ci-gateway-usage/-/pipelines/ to verify that jobs are still working properly
Set label changescheduled

Day 2 - `shared-gitlab-org` runners shard

Estimated Time to Complete (mins) - 20

Set label changein-progress on this issue
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1201+
Wait for changes to be applied through CI
Wait for changes to be applied by chef-client on the active shared-gitlab-org nodes (or force it by manual run of chef-client)
Run test pipeline in https://gitlab.com/tmaczukin-test-projects/test-git.ci-gateway-usage/-/pipelines/ to verify that jobs are still working properly - check the jobs with shared-gitlab-org in the name
Set label changescheduled

Day 3 - `shared` runners shard

Estimated Time to Complete (mins) - 20

Set label changein-progress on this issue
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1202+
Wait for changes to be applied through CI
Wait for changes to be applied by chef-client on the active shared nodes (or force it by manual run of chef-client)
Run test pipeline in https://gitlab.com/gitlab-org/test-git.ci-gateway-usage/-/pipelines/ to verify that jobs are still working properly - check the jobs with shared in the name

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 20

(Start here after Day 3 was handled) Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1202+
(Start here after Day 2 was handled) Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1201+
(Start here after Day 1 was handled) Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1200+
Wait for changes to be applied through CI
Wait for changes to be applied by chef-client on the active shared, shared-gitlab-org and private nodes (or force it by manual run of chef-client - use the deployment dashboard do check which color of which shard is currently "active")

Monitoring

Key metrics to observe

Metric: Failures on GitLab Inc. runners (by instance,failure_reason)
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-runner_job_failure_reason=All&from=now-1h&to=now&viewPanel=23
- What changes to this metric should prompt a rollback: A peak in job failures at the on the related shard and at the moment of applying the change. If it will rise significantly and stay like that, it may mean something is wrong. Looking into logs and checking random jobs will be needed to confirm if there is a problem caused to our change or no.
Metric: all
- Location: https://dashboards.gitlab.net/d/frontend-git-haproxy/frontend-ci-gateway-git-utilisation-based-on-haproxy
- What changes to this metric should prompt a rollback:
  
  Hard to say. It's a new dashboard that we've created to analyze the new internal traffic. So probably no change visible on the dashboard after applying configuration changes would be a signal that something is not right.

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Mar 11, 2022 by Devin Sylva

Switch runners to use ci-gateway ILB

Production Change

Change Summary

Change Details

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Day 1 - private runners shard

Day 2 - shared-gitlab-org runners shard

Day 3 - shared runners shard

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Monitoring

Key metrics to observe

Summary of infrastructure changes

Change Reviewer checklist

Change Technician checklist

Day 1 - `private` runners shard

Day 2 - `shared-gitlab-org` runners shard

Day 3 - `shared` runners shard