GitLab runners don't play nice with GitLab maintenance mode
Description
After an AZ failover we ran into an interesting failure mode with the gitlab runner code. Basically the scenario is that we had maintenance mode enabled for the failover, traffic got re-enabled to the new AZ with maintenance mode still enabled. This caused all runners in our fleet to temporarily disable themselves through this health_check codepath. Which is used through requestJob which calls RequestJob that has the switch case for a 403 response. From here we just need to trip the amount of common.HealthyChecks to end up disabling ourselves. Only after common.HealthCheckInterval we actually re-check and got back to a working CI fleet.
Proposal
A hardcoded 1 hour timeout seems pretty harsh so maybe making that configurable would be nice, that would enable one to set a random amount to not synchronise outages as well avoiding some thundering herd issues when all runners in the fleet come back at the same time.
Another idea I had is to maybe detect the maintenance mode in the response and not up the failure counter instead?
Alternatively there could be a runner maintenance flag as well that just temporarily disables them but I guess this is just pausing the runners. It would be nice not to have to build my own automation around that.
I'm quite happy to work on implementing the code for it myself but wanted to just get some feedback on the direction that is wanted.