Configurable interval for disabling an unhealthy runner

Description

When a GitLab instance is unavailable for an extended period of time (e.g. during a version upgrade), the GitLab runners configured for that instance will "sleep" and will not resume job processing for (from customer observation) 30-60 minutes after the GitLab instance is available again. This causes a large backlog of pipelines, and delays in deployment of projects.

Proposal

Add a configuration parameter to specify the maximum time for the runner to "sleep" once the API is determined to be unavailable, so that runners are polling for the API on a more frequent basis and resuming job processing more quickly. Add a value that would disable the "sleep" behavior altogether, and set runners to behave as if the API has been available on an ongoing basis to resume job processing as quickly as possible.

Links to related issues and merge requests / references

The check_interval parameter is documented to configure how frequently a runner checks for new jobs. However, what is not clear in that documentation is whether this affects the longer observed interval to resume job processing after an outage of the instance API.

This issue around maintenance mode appears to address a similar concern. It may be that this issue is redundant to that one, but it's not fully clear.

Edited Aug 10, 2022 by Christiaan Conover