Draft of Windows Shared Runners infrastructure architecture

We've planned to test the Autoscaler project by introducing Windows Shared Runners on GitLab.com. This is the main purpose why this project was created 😉. This issue will describe some initial ideas about how we could set-up the infrastructure for Windows Shared Runners.

WSR infrastructure and deployment

Requirements

Runner Manager VM needs to be powerfull. For Linux ones we're using a custom 12vCPU/16GB RAM since few days, to make ourselves some place for future growth. The concurrency will be set soon to 700 per manager. For Windows I think having concurrent = 400 and 10vCPU/16GB RAM should be the start.
We should install the Windows equivalent of node_exporter (I don't remember the name right now, I can find it later) and track the metrics with our Prometheus fleet.
We should enable metrics on Runner - it should work on Windows without problems 😉
We need graceful shutdown switch for Windows, since SIGQUIT doesn't exist there. We could use the web debug server (configured with metrics one), with a controll API guarded by some configurable token. It could be usable alsomfor Linux ones. This must be done as part of MVC, but remember that for beta release of WSR we can use a beta version of Runner 😉
Given the dynamic nature of deployment that I want to propose, Runner's and node's metrics should be tracked by CI Prometheus servers - we will need to add another autodiscovery here. These servers are federated with our main Alertmanager and with Thanos fleet, so we will have both alerting and access to the metrics in Grafana.

Naming convention

windows-shared runners-manager -X.gitlab.com for the managers names.
wsrm for the shortcuts, like the one conifgured as part of autoscaled VM names, future tracking of autoscaled VM metrics etc.
shared-runners-windows as Prometheus' job, so it will be consistent with the rest of our metrics and will be added to our graphs only by adjusting some templating variables in Grafana and few alerting rules.
windows and windows-[version_here] as the labels of the Runners in GitLab.

Configuration files

I've shared examples of a working configuration in [issue]. You can find both Runner and Autoscaler files there. Be careful with paths syntax!

We will also need a service account JSON key file. The service account must have minimal permissions set to what can be seen in group-verify GCP project. The ID of the test sevice account can be found in 1password.

General architecture

We have two Runner Managers based on Windows VMs. Both are publicly available for RDP. Metrics and control internal server accessible from limited list of IPs. Let's configure this with cloud firewall.
Two Prometheus servers based on Linux VMs. We can re-use the role that configures currently existing CI Prometheus severs. They will scrape Runner Managers and later the autoscaled VMs.
GCS bucket for shared cache.
Autoscaled VMs based on n1-standard-2 (???) VMs without public IP but with access to Internet through Cloud NAT.
Access from Runner Managers to the autoscaled VMs :right_arrow: the WinRM port.

Infrastructure management and deployments

Knowing how painful and problematic is our current deployment process and knowing the differences between Linux and Windows environments, I've made few assumptions:

Deployment should happen fullumin CI and be not bound to developer's machine.
Full infrastructure (currently: firewall rules, Runner Managers VMs, CI Prometheus VMs) should be managed by terraform.
Because of the graceful shutdown we should follow a blue/green like deployment for Runner Managers.
CI Prometheus servers should be managed by our current roles for CI Prometheus servers.
Since running chef on Windows is most likely not possible in our configuration (I remember I was talkingn bout this in then past with one of our SREs and it was: 'it will be hard and long way") lets use customly build VM images for that.

The infrastructure sdeclarative configurationhould be kept in a project:

terraform configuration for the GCP infrastructure,
packer for Managers VM image.

Process

When we want to change the configuration of Runner Manager/Autoscaler or deploy a new version, we commit and push change in configuration.
CI pipeline is started. If config files of Runner/Autoscaler or their versions were changed, a new image built is started. It uses packer, builds the image and tags it with some standarized name that can be next computed by other jobs in the pipeline.
Terraform apply is executed (with all previous needs, like running plan first). This applies changes in firewall, Prometheus nodes. But if hange of version/configuration of Runner/Autoscaler is discovered, it should only spin a new VMs.
If this is not the version/config case, the pipeline is finished after this stage.
In other case, a next job is started. It sends a request to old Runner Managers to start graceful shutdown. The jobs waits until the process is terminated (we can track metrics endpoint for this) but no longer than 2h (equal to timeout set on these Runners in GitLab). If it Runner didn'tmfinished during this time we lognthis and end the job.
Another job is started that reconfigures terraform to delete the old Runner Manager nodes. plan and apply are executed and nodes are deleted.

At any failure we send a slack notification and have alerting for them. In succesfull case wendo the same after the pipeline is finished (but without alerting).

We should also think about a process of cleaning old Manager VM images, to not waste space and be billed. After few updates we will definitely not need to roll-back to old images, and configuration may be always restored from git.

Edited Nov 07, 2019 by Tomasz Maczukin

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information