Avoid back-off for one-off failures of storage access.
Sometimes the circuitbreaker might trip because of high IO-wait on a single NFS-server.
To avoid single failures from blocking access to the storage I propose the following:
- Introduce a
circuitbreaker_access_retries
setting: This is the number of times we will try tostat
a storage within thestorage_timeout
. - Introduce a
circuitbreaker_backoff_threshhold
: This is the number of failures required before we start backing off the access attempts. Instead of starting to back off after one failure.
The different failure scenarios would then look like this:
-
failure_count > 0
andfailure_count < backoff_threshold
: Keep trying -
failure_count > backoff_threshold
andfailure_count < failure_count_threshold
: Block for a limited time (failure_wait_time
) after the last failure -
failure_count > failure_count_threshold
: Circuitbreaker tripped, the shard is probably down, manual intervention needed.
Since all the properties are currently in configuration files, we can't easily adjust them to work correctly for GitLab.com. To make that easier, we should collect these properties in the database:
- failure_count_threshold: number of failures before stopping attempts: this value should be at least 2 times
backoff_threshold
- failure_wait_time: Seconds after an access failure before allowing access again
- failure_reset_time: Time in seconds to expire failures in redis.
- storage_timeout: Time in seconds to wait before aborting a storage access attempt
- access_retries: The number of attempts to
stat
withingstorage_timeout
- backoff_threshold: The number of failures after which we start backing off access attempts for
failure_wait_time
: This value should be set to at least 2 times the number of workers we have (40 * 2).
To make it possible to try the Circuitbreaker on Canary, we will make it possible to enable the feature using an ENV variable.
Edited by Bob Van Landuyt