Kubernetes runner liveness check improvement
Description
Based on a gitlab support case, I've been asked by @duncan_harris to submit this as a feature request here. @duncan_harris please add any information here that you think is missing.
Summary:
The gitlab-runner verify
command is network dependent and has an internal timeout of just over 60s. Therefore the current liveness probe settings do not match up with this and can cause unexpected runner restarts and no details on liveness probe failure reason. Also, kubernetes liveness probes are intended for situations where a container is dead and not recoverable unless killed and restarted. I'm not sure they are being used appropriately here.
Details:
- The liveness script first block where it does the pgrep it exits with 1 without echoing out a failure reason that would go to the k8s event log
- For the gitlab-runner verify command if the probe times out before the script does then no interesting output is produced in the event log for this failure case either.
- I timed the gitlab-runner verify command and it appears to have an internal timeout of 60 seconds and the total process takes 62 seconds to run and eventually time out and give a reason/error. Based on this, perhaps the helm chart should set the default of probeTimeoutSeconds to 70? The current default is 3 seconds Or perhaps the gitlab-runner verify has an adjustable timeout that could be shortened to be before the liveness probe timeout amount so we can get a proper error in the event log?
- The liveness probe command for gitlab-runner verify hides output of failure reason by using
2>&1 |grep
Proposal
Evaluate what the intent is with the liveness probe against kubernetes guidelines and best practices. In my opinion, network connectivity issues and remote API failures to the gitlab API should potentially cause a readiness probe failure, but do not warrant a liveness probe killing the runner pod unless the runner pod is unable to recover from network connection loss on its own.