Hanging jobs when cutting 18.0.1, Fault Tolerance feature
When preparing for 18.0.1, when merging the relevant MR into main, we've seen a lot of hanging jobs, namely jobs running on our k8s cluster. https://gitlab.com/gitlab-org/gitlab-runner/-/pipelines/1821474628
We have observed, that the pods on the cluster where still running, but gitlab did not receive any updates anymore, and eventually they timed out.
It turned out, that times when the jobs appeared to be hanging correlated with the times the manager-killer ran.
Therefore we suspended this cronjob manually, which resolved the immediate issue:
kubectl patch cronjob -n gitlab-runner-system manager-killer -p '{"spec" : {"suspend" : true}}' -o yaml
There is the assumption, that releasing 18.0.0 deployed a runner-managers without the fault tolerance feature, and this then showed when the manager-killer kicked in.
We should:
- validate this assumption
- ensure that we don't accidentally deploy "non-fault-tolerant" runner-managers as part of the release process.