Skip to content

WIP: Add 'unhealthy' internal command to check runner health

Arran Walker requested to merge ajwalker/gitlab-runner:runner-healthcheck into master

What does this MR do?

The existing runner health check is modified to write out a health status file.

The status file only exists if a runner is unhealthy and the contents is a newline delimited list of unhealthy runners. If all runners return to being healthy, the status file is removed.

A new subcommand, unhealthy, checks the health status file and if any runner is unhealthy, a 1 exit code is returned. The subcommand, rather than solely the existance of the status file, is the preferred way to check if runners are healthy. This allows future changes in the way that we determine whether a runner is healthy, without changing or breaking the mechanism to do so.

The Runner Dockerfiles are updated with HEALTHCHECKs to periodically call the unhealthy subcommand.

Why was this MR needed?

The discussion in #3928 was to add a new HTTP endpoint. However, it is suggested that endpoint listeners and configuration to do so will change in the future.

This adds healthcheck support with limited changes, extending a health check that already exists, without having to add an endpoint right now. If we later add an endpoint, or additional health checks, we can have the subcommand call the endpoint (in the case of Docker, a healthcheck always requires the execution of a command inside of the container anyway).

Are there points in the code the reviewer needs to double check?

When a runner has a client error several times in a row it is temporarily disabled. The daemon doesn't exit, but an error is printed to the log. Eventually, the runner will attempt a client request again, and if it succeeds, it will return to being healthy.

Not all client errors were being returned as "unhealthy". This resulted in errors, such as contacting a non-existant server, to be endlessly printed to the log. This MR changes it so that all client errors are seen as "unhealthy".

Unfortunetly, the existing implementation could only return to being healthy after 1 hour. Now that all client errors are considered unhealthy, it could potentially cause a lot of problems for the runner to be disabled for an entire hour. This MR changes the retry time to 5 minutes. For client errors that do actually hit a remote server, the client's own backoff mechanism should also kick in for sane delays between requests.

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Added tests for this feature/bug
  • In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

#3928

Closes #3798

Edited by 🤖 GitLab Bot 🤖

Merge request reports