[meta] Deploy CI Runners infrastructure on K8S (while still use Docker Machine for autoscaling)

NOTICE:

To remove any confusion, please notice that this issue describes a plan for RUNNER MANAGERS DEPLOYMENT. Currently we're deploying Runner on a constant number of manually created VMs, that are managed by Chef. The plan is to deploy them inside of K8S cluster, which would drastically speed up and improve the deployment process.

The issue DOESN'T PROPOSE to switch from docker+machine executor to kubernetes executor. We still haven't figure out how to securely use K8S executor for a multi tenant environment like Shared Runners on GitLab.com

Here we will plan another big step for CI infrastructure, after moving to GCP (#3531 (closed)) - a Kubernetes based deployment. Initially we've planned to start working on this after we finish all tasks in #3531 (closed), but the last one left (moving out from self-managed cache servers to GCS support) and coplexity of monitoring infrastructure for gprd and gstg suggests that it would be good moment to start thinking about this now.

Background

To get familiar with current architecture of CI infrastructure please read https://docs.google.com/document/d/1WYmN5oukY3DK2hPFLPkxwnuyfxES8nNPeDLMTN_KhVM (internal only). Notice: The document was created a long time ago, and few things were changed a little since then. But it still can be used to understand better what the graph is showing. And were I will share only ~~two graphs~~ a graph that should give the general knowledge.

Old graphs from the linked document

General architecture (notice: DigitalOcean is now disabled by default and it's used as a backup environment)

Architecture of the infrastructure inside of cloud provider (notice: in GCP we're not using consul anymore - we've switched to Prometheus' native GCP service discovery)

(source file in graphml)

First proposal of K8S-based architecture was described in December 2017 in a Google document (internal only).

So, let's ask first the big question - why at all we want to change current deployment method?

Current deployment procedure is highly not user-friendly.

In short: to update a Runner we need to start a graceful shutdown so user jobs will be not interrupted. Our configuration sets a 2-hour timeout for a shutdown process, but we should align it soon with the timeout that is set for Runners (which means that for Shared Runners on GitLab.com it should be 3 hours). To make the CI responsive we also can't update all managers at once. This means that in worst case, when on each Runner there is a long-running job started just before the shutdown was triggered, the deployment may take up to 12 hours!

The process have also a SPOF - the deployers machine. We're using chef to manage configuration and the update is started with one command executing chef update on all machines sequentially. If the deployer will have a networking issue - the procedure is interrupted in the middle. There is also no clean and easy way to pass the deployment to another person.

After discussing several solutions we've decided that using K8S will be the best one, because:

K8S becomes the best cloud native way to manage applications.
With K8S we still have the whole configuration stored in one place and managed in declarative way.
Using K8S for deployment we'll be able to re-create the environment in minutes instead of hours (which would be the case currently if some of our nodes would get dead - we've already been there with one of our cache servers).
If we chose something like Helm we can make the deployment management full remote and possible to track from different machines (so we're removing the SPOF caused by deployer's machine).
Even better - configuration linting and deployment may be changed to CI/CD jobs and fully integrated with mechanisms that we have in GitLab.
Managing a service with K8S is one of our directions - and we will handle things in exactly the same way as we're proposing to our users.

General architecture proposal

We will still use docker+machine executor as autoscaling method. Currently we don't plan to use K8S as the executor.
We will use Helm as deployment method and store the whole configuration in a set of helm charts.
We will move all parts of the infrastructure (so not only managers, but also monitoring tools) also to K8S - this will remove one layer of configuration management (chef).
The K8S cluster will be created by Terraform, so re-creating it will be fast and easy (and we will be able to re-use present configuration for - for example - IAP guarding some of the services).
We will switch from self-hosted cache servers to GCS usage (one thing less to manage, most probably much better performance; already tracked with #4565 (closed)).
The implementation should be created in a way that allows to treat part of the infrastructure as a staging environment.

So let's look on a diagram now and describe some details:

dashboards.gitlab.com and dashboards.gitlab.net are parts of GitLab's infrastructure, but not the CI Runners infrastructure. They are present on the diagram only to show the integration between these services and the infrastructure that this issue is talking about.

Kubernetes cluster

The biggest part of the infrastructure diagram is the K8S cluster. We can see there five namespaces: four of them (gitlab-com, staging-gitlab-com, ops-gitlab-net and dev-gitlab-org) are containing Runners fleets. The fifth one (monitoring) contains services that are centralizing and exporting monitoring of the whole infrastructure.

Runners fleet

Each of the fleets contains a number of runner managers and a Prometheus server.

As it can be seen on the image - namespaces gitlab.com and staging-gitlab-com are similar - they vary only with the capacity of handled jobs. It's designed like that to make the staging-gitlab-com namespace a staging sandbox for CI Runners infrastructure. It will provide Runners for staging.gitlab.com (so we'll be able to test changes in GitLab CI on staging) but it will also give us a place to experiment with Runner's configuration. This means that we should be able to change the configuration of runner's fleets and monitoring in each of the namespaces separately.

Ideally each runner manager provides a Runner service that should handle 100 concurrent jobs. If this number is exceeded additional containers should be scheduled (up to defined limits). When number of jobs goes down, containers should be also scaled down. However for the first iteration we can create a fixed number of containers for each runner manager group, to replicate our current configuration. Suggested number of containers for each of the groups is defined on the image by X and Y.

Runner managers - on creation - are registering with a predefined configuration against selected GitLab instances. Each of the described groups (like shared-runners-manager-X or chargs-runner-manager-Y) uses a slightly different configuration. Each of the managers is working with GCP and autoscaled machines in the same way as they are working now (so Docker Machine schedules machines and then Runner uses Docker API to schedule the jobs). Ideally we should work on gitlab-org/gitlab-ce#40693, which will remove the need of registering a Runner - we could just re-use a predefined configuration and share the token between multiple containers for each of the groups.

What will be new in %11.3, and what we need think about, is the Web Terminal feature. In short - user through websocket and GitLab connects to Runner, which manages session terminals and proxies connection to the autoscaled machine. Eventually it allows user to execute commands directly in the job's environment. This means that each of the runner manager services needs to be exposed outside of the K8S cluster. The endpoint needs to be know at service creation moment since it needs to be added to Runner's configuration before start. At this moment it seems that it will be best to use K8S ingress feature (with hostnames dynamically created per runner manager service container).

Access to session terminal endpoints on ingress should not be public - should be available only for GitLab instances (for GitLab.com and staging.gitlab.com it can be handled in an GCP internal network, for dev.gitlab.org we should figure out the best solution).

Prometheus server in the runner's fleet namespace uses native K8S service discovery and GCP service discovery to find:

runner manager services in the namespace,
autoscaled machines created by runner managers from the namespace.

It then connects to node_exporter on autoscaled machines and directly to Runner processes to scrape exported metrics. Finally it provides a pre-aggregations that will be scraped with federation feature by the main Prometheus server (from the monitoring namespace).

Monitoring

This namespace contains two services - Prometheus server and GCP exporter.

GCP exporter, using GCP API, track available statistics about GCP usage. It's next exported in form of Prometheus metrics.

The Prometheus server is the main monitoring server for CI Runner's infrastructure. It scrapes GCP exporter directly and it scrapes pre-aggregated metrics from Runner fleet namespaces via the federation feature. This server is exposed via K8S ingress and is set as a source for dasboard.gitlab.net and dashboard.gitlab.com Grafana instances. Prometheus server should be exposed publicly via IAP (so GitLab team members can access it directly). For Grafana instances the traffic should go internally via GCP network.

Autoscaled machines

At the very bottom we have the autoscaled machines created as Google Cloud Engine VMs. These machines are using Google Cloud Storage as the distributed cache server. They are also connecting to GitLab instances to get sources of projects and download/upload artifacts (however in case of our environments they will be redirected to Object Storage server where these are stored).

Questions

How to gather Runner's logs? Stackdriver? Will this work for Runner executed as K8S containers?

gitlab-org/gitlab-runner#3336 (closed) should additionally improve logs handling.
How to access session servers on Runners? Ingres with dynamically created hostnames seems to be a good idea.
What should be the size of k8s cluster? How many nodes? Should we have static number of managers?

(answer from the linked Google doc): initial 3 nodes, auto-scale to as many needed, make each manager to run up-to 100 jobs, scale if going above
Do we want to distribute K8S cluster in GCP between regions? Do we want to have two clusters in GCP - one per region?
(answer from the linked Google doc):
- distribute between availability zone
- have two clusters in two different regions, region local to GitLab to process most of jobs, remote region to takeoff and auto-scale if we go over some threshold of pipeline queueing time
Do we have any GitLab or GitLab Runner features that would help us but are not implemented?
(answer from the linked Google doc):
- gitlab-org/gitlab-ce#40693
How to connect internal networks of gitlab-ci project (the one where CI Runner's infrastructure is stored) and the project where GPRD and GSTG environments are stored?
- vpc peering

Tasks list

Edited Sep 19, 2019 by Tomasz Maczukin