Liveness and Readiness Probe Failures for Gitlab-Runner on Kubernetes v1.20+

Hi Folks! First want to give a rundown of our environment where we faced this issue

Environment:

Helm Chart version: 0.27.0

Kubernetes version: separate clusters running versions 1.20 and 1.17

Running on AWS EKS

Problem-Description:

We are running our gitlab-runners on AWS EKS, using the kubernetes executor. Recently we have noticed that on our Kubernetes cluster which runs version 1.20 we encounter intermittent failures of the liveness and readiness probes like the following in the events of our gitlab-runner pod:

  Warning  Unhealthy  8s    kubelet            Liveness probe failed:
  Warning  Unhealthy  8s    kubelet            Readiness probe failed:

To debug the issue, we manually modified our gitlab-runner-gitlab-runner configmap to print some information about running processes in the event that our livenessProbe failed. We suspected that the runner process was being killed. Our debug changes looked like the following:

  check-live: |
    #!/bin/bash
    if /usr/bin/pgrep -f .*register-the-runner; then
      exit 0
    elif /usr/bin/pgrep gitlab.*runner; then
      exit 0
    else
      echo "Failing pgrep $(ps aux)"
      exit 1
    fi

We confirmed the validity of this method by modifying the configmap temporarily to pgrep for gitlab.*runner123. This returned an event like the following:

  Warning  Unhealthy  2s    kubelet            Liveness probe failed: Failing pgrep [ps aux output]

However, when we made the debug change, the next time we saw a Liveness probe failure, the output was:

  Warning  Unhealthy  8s    kubelet            Liveness probe failed:

Lacking the Failing pgrep $(ps aux) output. This confirmed another suspicion that the issue was not that the process was killed, but that the probe exec commands failed to execute.

Looking through the Kubernetes documentation we discovered this note on exec probes:

Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.

In the cluster we observed the issue we are running Kubernetes v1.20. In these clusters the 1s timeouts for probes are being enforced and in certain cases not met, resulting in the failures without any output.

In other clusters where we have long had gitlab-runners deployed, we had not ever seen this issue. On those other, non-problematic, clusters we are running Kubernetes v.1.17. In these clusters, as the documentation states, the timeouts for probes are not being enforced, and so if a probe takes longer than the listed timeout of 1s it is still allowed to successfully return, resulting in no failures.

To confirm our results, we deployed a new runner on a different cluster that was also running v1.20 and after several minutes once again faced probe failures.

Resolution

Per the issue, we need a way to increase out probes timeoutSeconds beyond 1s so that we no longer see these failures.

The Kubernetes documentation suggests the following:

As a cluster administrator, you can disable the feature gate ExecProbeTimeout (set it to false) on each kubelet to restore the behavior from older versions, then remove that override once all the exec probes in the cluster have a timeoutSeconds value set.

Using the ExecProbeTimeout feature gate does not seem intended as a long term solution. As we are deploying our runner via these Helm Charts, we are constrained to what we can configure in the charts. Currently the timeoutSeconds for the liveness and readiness probes are not configurable in the Helm Chart.

We could use and modify a forked version of the helm chart, but would prefer to be able to continue to update to latest versions of the helm chart, without managing the changes ourselves in a fork.

I have made an MR to allow the timeoutSeconds for the probes to be configurable, with a default of 1s, the current hardcoded setting.

I hope you will consider this a worthwhile addition to the project, and thanks for your help!

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information