Runner misbehaves when GitLab is in maintenance mode

Summary

We've had customers report that when the runners go offline due to maintenance mode being activated for an extended period of time they take a long time to come back, about 1 1/2 hours.

GitLab staff members can see the customer tickets at:

Steps to reproduce

The best way I have found to reproduce the issue without an extensive wait is to start the runner after maintenance mode is disabled. Otherwise the issue will occur after an hour or so.

Put GitLab into maintenance mode under Admin > Settings > General > Maintenance mode
Start the registered runner with gitlab-runner run
After a short duration the runner will output the error "runner <> is not healthy and will be disabled!".
Once this error appears then then disable maintenance mode in GitLab
The runner will take a long period of time to start picking up jobs again.
If you wait 2 hours for the runner to go offline then disable maintenance mode it remains offline for a long duration due to (5).

Actual behavior

I think what's happening is that when maintenance mode is enabled the gitlab-runner health helper triggers the error is not healthy and will be disabled! and marks the runner as disabled. This is in the form of a lock on the runner that resets at the runner health check interval which is a constant set to 3600 seconds / 1 hour.

So if one disables maintenance mode shortly after the last health check then it will take up to an hour for the runners to come back online.

The issue appears to be more predominant with docker-machine auto-scaling, possibly due to new machines being created at different times between the health checks.

Expected behavior

Relevant logs and/or screenshots

Environment description

Tested with self-managed GitLab v15.1.2

Used GitLab Runner version

15.1.0

Proposal

This appears to be related to Improve runner support for maintenance mode. There is a proposal which might alter the above behavior but I suspect that 503's will cause the same issue (not sure how to reproduce this):

Given maintenance mode is a temporary server state, perhaps a 503 is a better response type that runners can interpret as a retry-able error.

However I think a solution for this specific problem would be to throw a code that the Runner would recognize as maintenance mode so that the runner logs the error "Server in maintenance mode" and continue to perform a different check which does not mark the runner as unhealthy.

Implementation Tasks

In maintenance mode the GitLab instance returns a specific maintenance mode error code or message. Stretch goal - include the duration as a time value. (Code change in Rails)
Add option to Runner health_helper to check to the maintenance mode error code and time duration. If value is set, then the runner should use this value to adjust the unhealthy_interval value. (Code change in gitlab-runner)

Edited May 03, 2024 by Adam Mulvany