Circuit breakers on misconfiguration
Problem to solve
During a misconfiguration (see gitlab-ce#48338), sidekiq was starting, but crashed within a few milliseconds after starting. This resulted in a restart-loop for sidekiq, which in turn resulted in a complete saturation of all server CPUs:
Further details
A simple misconfiguration should never result in the server being unresponsive due to cpu load - especially if the misconfiguration results in a restart loop of a service.
Proposal
Implement a circuit breaker, which will stop restarting a service after a certain threshold of restarts per time interval (e.g. 5 restarts within a minute) has been reached.
What does success look like, and how can we measure that?
- Add a mock for sidekiq, which will fail after starting
- Start the mock
- The mock should fail in a few milliseconds after starting
Result after implementation:
- The sidekiq watchdog will not restart the service after 5 restarts and
- produce an appropriate error in the sidekiq log file
The circuit-breaker flag prohibiting a restart should be reset after a manual intervention, e.g. by calling gitlab-ctl (stop|start|restart)
Links / references
Edited by 🤖 GitLab Bot 🤖
