Circuit breakers on misconfiguration

Problem to solve

During a misconfiguration (see gitlab-ce#48338), sidekiq was starting, but crashed within a few milliseconds after starting. This resulted in a restart-loop for sidekiq, which in turn resulted in a complete saturation of all server CPUs:

Auswahl_012

Further details

A simple misconfiguration should never result in the server being unresponsive due to cpu load - especially if the misconfiguration results in a restart loop of a service.

Proposal

Implement a circuit breaker, which will stop restarting a service after a certain threshold of restarts per time interval (e.g. 5 restarts within a minute) has been reached.

What does success look like, and how can we measure that?

  • Add a mock for sidekiq, which will fail after starting
  • Start the mock
  • The mock should fail in a few milliseconds after starting

Result after implementation:

  • The sidekiq watchdog will not restart the service after 5 restarts and
  • produce an appropriate error in the sidekiq log file

The circuit-breaker flag prohibiting a restart should be reset after a manual intervention, e.g. by calling gitlab-ctl (stop|start|restart)

Links / references

Edited by 🤖 GitLab Bot 🤖