Make Kubernetes executor fault tolerant (!5005) · Merge requests · GitLab.org / gitlab-runner

⚠ This MR's branch is based on ggeorgiev/fault-tolerance-file-store ⚠

⚠ This is MR #3 in the chain ⚠

Add fault tolerance to build processing machinery (!5003 - closed) • Georgi N. Georgiev | GitLab • 18.5
Add file store to fault tolerance (!5004 - closed) • Georgi N. Georgiev | GitLab • 18.5
Make Kubernetes executor fault tolerant (!5005 - closed) • Georgi N. Georgiev | GitLab • 18.5 (You are here)

What does this MR do?

The final MR in the chain. With this MR we should be able to run jobs in fault tolerant fashion on Kubernetes.

Why was this MR needed?

What's the best way to test this MR?

In your Kubernetes Runner's config configure the store:

[runners.store]
name = "file"

[runners.store.file]
path = "/tmp/store"

Start a job that counts to 60 for example:

fault-tolerance-counter:
  image: ubuntu
  script:
    - counter=1; while [ $counter -le 60 ]; do echo $counter; ((counter++)); sleep 1; done
  tags:
    - k8s-local

At some point while the job is running kill the manager with SIGKILL. Restart the manager. By default after 30 seconds have passed the job can be resumed.

It should resume normally.

What are the relevant issue numbers?

Edited Sep 16, 2024 by Georgi N. Georgiev | GitLab

Make Kubernetes executor fault tolerant

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports