Deliver new scalable and maintainable runners infrastructure for gitlab-qa
DRI: @kwanyangu In https://gitlab.com/gitlab-org/gitlab-qa/-/issues/261 it was decided to move the ownership and maintenance runner manager from the Distribution team to the Quality team since it is being used for QA pipelines only. To enable that, we need to: 1. move the runners used for `gitlab-qa` runs as part of GitLab development to a dedicated project in GCP - `gitlab-qa-runners`, 2. Create a VM in this project, with `gitlab-runner` and `docker-machine` installed and managed by `chef-repo`, similar to [Distribution team's runners](https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/master/roles/build-trigger-runner-manager-gitlab-org.json). Additional links for context - [Ownership of runner manager machine used for QA jobs](https://gitlab.com/gitlab-org/gitlab-qa/-/issues/261) - [Create a new VM for `gitlab-qa` runner manager to be managed by chef-repo](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14333) ## Next steps are - 1. [x] [Define `qa-runners` environment in `config-mgmt` gl-infra terraform project](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24380). 1. [x] Provision `gitlab-qa-runners-2` project using `config-mgmt/environments/env-projects`. 1. [x] Set `create_legacy_service_account = true` attribute on the `gitlab-qa-runners-2` project in `config-mgmt/environments/env-projects`. 1. [x] [Provision a keyring for chef bootstrapping secrets](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24450). 1. [x] [Provision service accounts](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24452) as necessary. 1. [x] [Provision bootstrap secret GCS buckets](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24449). 1. [x] [Provision vpc networking](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24451). 1. [x] [Set up chef bootstrapping secrets](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24449). 1. [x] [Define `qa-runners-base` chef role](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24527). 1. [x] [Define `qa-runners-base-bastion` chef role](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24528). 1. [x] [Define a private service connect to vault.ops.gke.gitlab.net](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24572). 1. [x] [Add new qa-runners stanza to the chef_environments local in the chef module of config-mgmt/environments/vault-production](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24623) 1. [x] [Add chef-environment definitions for vault in config-mgmt](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24604). 1. [x] [Generate and configure env/qa-runners/cookbook/cookbook-gitlab-runner/runners-manager-private secret json blob](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24602). 1. [x] [Generate consul agent client certificate and key and configure them in vault](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24615). 1. [x] [[Stretch goal] Repair bastion host ssh keys configuration](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24659). 1. [x] [Provision service account, role, and user for the runners-manager](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24625). 1. [x] [Provision service account for ephemeral runner instances](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24628). 1. [x] [Provision service account, role, and permissions used by a script which cleans up stale runner instances](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24629). 1. [x] [Define service account to support ephemeral runners usage of GCS for artifact caching](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24639) 1. [x] [Provision a runners-manager instance](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24446). 1. [x] [Validate and repair the chef role configurations for the runners-manager instance](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24249). 1. [x] [Create and configure runners-manager runners token in vault](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24681). 1. [x] [Provision a bastion host](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24447). 1. [x] [[Stretch goal] Repair bastion ssh proxy](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24693). 1. [ ] [Document scaling management procedures and present resources to QA](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24403). 1. [ ] [Stretch goal] Shepherd migration of legacy non-managed runners-manager workloads into the new apparatus and expand the sub-network CIDR range for extra ephemeral runners. 1. [ ] [Stretch goal] Work with the Observability Team to implement thanos, prometheus, grafana, and fluentd configuration for monitoring and logging. [Previous status recorded as comment](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1095#note_1630082587). ### KR https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/4083 ## Status 2023-11-29 The infrastructure expectations have been completed and a new runner-manager is setup in the GCP project `gitlab-qa-runners-2`. This runner manager is managed by `chef-repo`, similar to how we do [Distribution](https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/master/roles/build-trigger-runner-manager-gitlab-org.json). Test Platform will pick up the next steps of deploying runners and track that in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1095.
epic