Dogfooding the Kubernetes executor

A lot of customers use the Kubernetes cluster but the GitLab engineers that are working on it - do not.

This leads to a few key points I would like to make:

While some of us have expertise around Kubernetes, it's too large a topic to be proficient at without regular exposure. This leads to using inefficient options and practices.
Not using the Kubernetes executor ourselves we end up not catching bugs such as 16.3.0 Kubernetes runner pods not cleaned up (#36803 - closed). While we do have integration tests there's only so much we can test without any actual real-world production loads.
Not having any direct experience creating and maintaining a production-grade Kubernetes cluster does not allow us to provide good solutions to regular problems customers face, such as a complete monitoring solution/guidelines, proper cluster setup among other things.

Apart from these, there are common issues which we deal with ourselves too:

Improve GitLab Runner integration tests for the... (#36827 - closed) - Integration tests have been shaky lately. While the k3s-inside-ci setup worked for a while but we've been hitting walls lately.
The Kubernetes executor is currently not tested in a Windows environment.

I propose we create and manage our own Kubernetes cluster which should get us closer to covering all the points above.

The initial plan could be:

Start with a GKE cluster. @ratchade has already provided instructions how to do that and with windows as a bonus - https://gitlab.com/gitlab-org/ci-cd/runner-tools/gke-test-cluster-setup/-/blob/main/README.md?ref_type=heads. I would like us to avoid Autopilot as that abstracts way too much away from us. If we think it makes sense at some point we can change that.
The cluster should be created through reproducible scripting of some kind.
Deployment of Runner should happen frequently. We should discuss this but every successful pipeline from main being deployed sounds like a good idea. This will allow us to test the bleeding edge more.
Implement monitoring for the cluster as our CI will depend on it. Prometheus with Grafana. We already utilize Grafana for our shared and private runner metrics. There's also the case of using the cluster for all sorts of end-to-end tests such as Runner Fleeting / Taskscaler / GRIT Test Plan (#36787). This coupled with our integration tests means that we could leave the cluster cluttered, which could bring it to a halt. This is something the monitoring should catch. Stale resources, cluster capacity etc.
Start by running only the Runner integration tests in this cluster.
Move side projects to this cluster. E.g. release pipelines, chart CI, UBI images etc.
At this point we should either have already considered or be considering a production-grade Runner setup with high availability in multiple zones, on multiple nodes, with automatic zero downtime rolling upgrades.
Move all Runner CI jobs to using our Kubernetes cluster.

Edited Dec 04, 2024 by Georgi N. Georgiev | GitLab