Liveness and Readiness Probe Failures for Gitlab-Runner on Kubernetes v1.20+
Hi Folks! First want to give a rundown of our environment where we faced this issue
Environment:
Helm Chart version: 0.27.0
Kubernetes version: separate clusters running versions 1.20
and 1.17
Running on AWS EKS
Problem-Description:
We are running our gitlab-runners on AWS EKS, using the kubernetes executor. Recently we have noticed that on our Kubernetes cluster which runs version 1.20
we encounter intermittent failures of the liveness and readiness probes like the following in the events of our gitlab-runner pod:
Warning Unhealthy 8s kubelet Liveness probe failed:
Warning Unhealthy 8s kubelet Readiness probe failed:
To debug the issue, we manually modified our gitlab-runner-gitlab-runner
configmap
to print some information about running processes in the event that our livenessProbe
failed. We suspected that the runner process was being killed. Our debug changes looked like the following:
check-live: |
#!/bin/bash
if /usr/bin/pgrep -f .*register-the-runner; then
exit 0
elif /usr/bin/pgrep gitlab.*runner; then
exit 0
else
echo "Failing pgrep $(ps aux)"
exit 1
fi
We confirmed the validity of this method by modifying the configmap
temporarily to pgrep
for gitlab.*runner123
. This returned an event like the following:
Warning Unhealthy 2s kubelet Liveness probe failed: Failing pgrep [ps aux output]
However, when we made the debug change, the next time we saw a Liveness probe failure, the output was:
Warning Unhealthy 8s kubelet Liveness probe failed:
Lacking the Failing pgrep $(ps aux)
output. This confirmed another suspicion that the issue was not that the process was killed, but that the probe exec
commands failed to execute.
Looking through the Kubernetes documentation we discovered this note on exec probes:
Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.
In the cluster we observed the issue we are running Kubernetes v1.20
. In these clusters the 1s
timeouts for probes are being enforced and in certain cases not met, resulting in the failures without any output.
In other clusters where we have long had gitlab-runners deployed, we had not ever seen this issue. On those other, non-problematic, clusters we are running Kubernetes v.1.17
. In these clusters, as the documentation states, the timeouts for probes are not being enforced, and so if a probe takes longer than the listed timeout of 1s
it is still allowed to successfully return, resulting in no failures.
To confirm our results, we deployed a new runner on a different cluster that was also running v1.20
and after several minutes once again faced probe failures.
Resolution
Per the issue, we need a way to increase out probes timeoutSeconds
beyond 1s
so that we no longer see these failures.
The Kubernetes documentation suggests the following:
As a cluster administrator, you can disable the feature gate ExecProbeTimeout (set it to false) on each kubelet to restore the behavior from older versions, then remove that override once all the exec probes in the cluster have a timeoutSeconds value set.
Using the ExecProbeTimeout
feature gate does not seem intended as a long term solution. As we are deploying our runner via these Helm Charts, we are constrained to what we can configure in the charts. Currently the timeoutSeconds
for the liveness and readiness probes are not configurable in the Helm Chart.
We could use and modify a forked version of the helm chart, but would prefer to be able to continue to update to latest versions of the helm chart, without managing the changes ourselves in a fork.
I have made an MR to allow the timeoutSeconds
for the probes to be configurable, with a default of 1s
, the current hardcoded setting.
I hope you will consider this a worthwhile addition to the project, and thanks for your help!