What does this MR do?
In this MR we try to avoid one-off failures of the circuitbreaker and improve it's robustness:
It will now perform the stat check multiple times within the timeout. This can be specified using the
circuitbreaker_access_retries application setting.
There are also 2 failure modes now:
- As soon as we reach
circuitbreaker_backoff_thresholdfailures for a shard on a host, access to the shard on that host will be blocked for
- When we reach
circuitbreaker_failure_count_thresholdwe will block all access until information is reset manually or after
While I was doing that, I also made the methods in the circuit breaker that aren't used publicly private.
Does this MR meet the acceptance criteria?
Changelog entry added, if necessary
API support added
Tests added for this feature/bug
Has been reviewed by Backend
What are the relevant issue numbers?
Closes #37383 (closed)
Closes #38231 (closed)