Skip to content

Make Kubernetes executor fault tolerant

This MR's branch is based on ggeorgiev/fault-tolerance-file-store

This is MR #3 in the chain

  1. Add fault tolerance to build processing machinery (!5003 - closed) • Georgi N. Georgiev | GitLab • 18.5
  2. Add file store to fault tolerance (!5004 - closed) • Georgi N. Georgiev | GitLab • 18.5
  3. Make Kubernetes executor fault tolerant (!5005 - closed) • Georgi N. Georgiev | GitLab • 18.5 (You are here)

What does this MR do?

The final MR in the chain. With this MR we should be able to run jobs in fault tolerant fashion on Kubernetes.

Why was this MR needed?

What's the best way to test this MR?

In your Kubernetes Runner's config configure the store:

[runners.store]
name = "file"

[runners.store.file]
path = "/tmp/store"

Start a job that counts to 60 for example:

fault-tolerance-counter:
  image: ubuntu
  script:
    - counter=1; while [ $counter -le 60 ]; do echo $counter; ((counter++)); sleep 1; done
  tags:
    - k8s-local

At some point while the job is running kill the manager with SIGKILL. Restart the manager. By default after 30 seconds have passed the job can be resumed.

It should resume normally.

What are the relevant issue numbers?

Edited by Georgi N. Georgiev | GitLab

Merge request reports

Loading