Make Kubernetes executor fault tolerant
ggeorgiev/fault-tolerance-file-store
#3 in the chain
- Add fault tolerance to build processing machinery (!5003 - closed) • Georgi N. Georgiev | GitLab • 18.5
- Add file store to fault tolerance (!5004 - closed) • Georgi N. Georgiev | GitLab • 18.5
- Make Kubernetes executor fault tolerant (!5005 - closed) • Georgi N. Georgiev | GitLab • 18.5 (You are here)
What does this MR do?
The final MR in the chain. With this MR we should be able to run jobs in fault tolerant fashion on Kubernetes.
Why was this MR needed?
What's the best way to test this MR?
In your Kubernetes Runner's config configure the store:
[runners.store]
name = "file"
[runners.store.file]
path = "/tmp/store"
Start a job that counts to 60 for example:
fault-tolerance-counter:
image: ubuntu
script:
- counter=1; while [ $counter -le 60 ]; do echo $counter; ((counter++)); sleep 1; done
tags:
- k8s-local
At some point while the job is running kill the manager with SIGKILL. Restart the manager. By default after 30 seconds have passed the job can be resumed.
It should resume normally.
What are the relevant issue numbers?
Edited by Georgi N. Georgiev | GitLab