Runner misbehaves when GitLab is in maintenance mode
Summary
We've had customers report that when the runners go offline due to maintenance mode being activated for an extended period of time they take a long time to come back, about 1 1/2 hours.
GitLab staff members can see the customer tickets at:
Steps to reproduce
The best way I have found to reproduce the issue without an extensive wait is to start the runner after maintenance mode is disabled. Otherwise the issue will occur after an hour or so.
- Put GitLab into maintenance mode under Admin > Settings > General > Maintenance mode
- Start the registered runner with
gitlab-runner run
- After a short duration the runner will output the error "runner <> is not healthy and will be disabled!".
- Once this error appears then then disable maintenance mode in GitLab
- The runner will take a long period of time to start picking up jobs again.
- If you wait 2 hours for the runner to go offline then disable maintenance mode it remains offline for a long duration due to (5).
Actual behavior
I think what's happening is that when maintenance mode is enabled the gitlab-runner health helper triggers the error is not healthy and will be disabled! and marks the runner as disabled. This is in the form of a lock on the runner that resets at the runner health check interval which is a constant set to 3600 seconds / 1 hour.
So if one disables maintenance mode shortly after the last health check then it will take up to an hour for the runners to come back online.
The issue appears to be more predominant with docker-machine auto-scaling, possibly due to new machines being created at different times between the health checks.
Expected behavior
Relevant logs and/or screenshots
Environment description
Tested with self-managed GitLab v15.1.2
Used GitLab Runner version
15.1.0
Proposal
This appears to be related to Improve runner support for maintenance mode. There is a proposal which might alter the above behavior but I suspect that 503's will cause the same issue (not sure how to reproduce this):
Given maintenance mode is a temporary server state, perhaps a 503 is a better response type that runners can interpret as a retry-able error.
However I think a solution for this specific problem would be to throw a code that the Runner would recognize as maintenance mode so that the runner logs the error "Server in maintenance mode" and continue to perform a different check which does not mark the runner as unhealthy.
Implementation Tasks
-
In maintenance mode the GitLab instance returns a specific maintenance mode error code or message. Stretch goal - include the duration as a time value. (Code change in Rails) -
Add option to Runner health_helper
to check to the maintenance mode error code and time duration. If value is set, then the runner should use this value to adjust theunhealthy_interval
value. (Code change in gitlab-runner)