Migrate Kubernetes integration tests to use the Runner Kubernetes Cluster (!5175) · Merge requests · GitLab.org / gitlab-runner

What does this MR do?

⚠️ This MR is part of a series:

Migrate Kubernetes integration tests to use the... (!5175 - merged) • Georgi N. Georgiev | GitLab • 17.7 (You are here)
Improve Operator installation and add GKE features (gitlab-org/ci-cd/runner-tools/grit!124 - merged) • Georgi N. Georgiev | GitLab • 17.10
https://gitlab.com/gitlab-org/ci-cd/runner-tools/runner-kubernetes-infra/-/merge_requests/1+s
Remove test dirs and flatten operator modules (gitlab-org/ci-cd/runner-tools/grit!146 - merged) • Georgi N. Georgiev | GitLab • 17.10\
Allow deploying multiple runners in a single k8... (gitlab-org/ci-cd/runner-tools/grit!147 - merged) • Georgi N. Georgiev | GitLab • 18.1
https://gitlab.com/gitlab-org/ci-cd/runner-tools/runner-kubernetes-infra/-/merge_requests/2+s
Run CI jobs in kubernetes (gitlab-org/charts/gitlab-runner!504 - merged) • Georgi N. Georgiev | GitLab • 17.9

⚠️

Migrates the existing Kubernetes integration tests to run into the already setup Kubernetes cluster through https://gitlab.com/gitlab-org/ci-cd/runner-tools/runner-kubernetes-infra/-/merge_requests/1/diffs rather than a k3s instance ran inside a VM.

Some notable changes in this MR:

Tests are split into 3 separate jobs: legacy strategy, attach and all non-ff tests
All tests can be ran in parallel now and are marked as such
A resource group is set on the integration tests to prevent running too many pods at once onto the cluster as the concurrency is quite high to prioritize good job timings
The jobs now run for about 5 minutes down from 30
I've fixed as many flaky tests as I could - at least the ones that were simple enough
Other tests I've skipped, I'll create an issue to fix them as a followup as many seem unplausable
We also now generate all permissions required to run a manager in Kubernetes as a yaml file
This yaml file is used to provision RBAC permissions in the cluster, if the objects are not deleted after (destroy doesn't run for some reason) - the cluster will clean them up, no need to worry for that
All objects that are created in the tests namespace will cleaned up at some point if not deleted, we should still try to cleanup after ourselves though

Update, test are down to 3 minutes per job now.

I switched from splitic to gotestsum. The biggest purpose of splitic is to split the tests into multiple jobs, which isn't applicable here. gotestsum has the following advantages:

Startup time is faster - that's where the 2 saved minutes come from
It supports retrying failed tests, we retry tests up to 3 times now, which should reduce flakyness
We generate junit reports now. Splitic doesn't support putting the failed test's output into the report as far as I could see, gotestsum does that so it's easier to find out why a test failed

Why was this MR needed?

To make the integration tests more reliable

What's the best way to test this MR?

Integration tests should run

Running in a local cluster locally should also work

What are the relevant issue numbers?

Closes Dogfooding the Kubernetes executor - Step 2 - R... (#38305 - closed) • Georgi N. Georgiev | GitLab • 17.7, Run kubernetes integration tests with a service... (#38306 - closed) • Georgi N. Georgiev | GitLab • 17.7

Edited Jan 07, 2025 by Georgi N. Georgiev | GitLab

Migrate Kubernetes integration tests to use the Runner Kubernetes Cluster

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports