Avoid back-off for one-off failures of storage access.

Sometimes the circuitbreaker might trip because of high IO-wait on a single NFS-server.

To avoid single failures from blocking access to the storage I propose the following:

Introduce a circuitbreaker_access_retries setting: This is the number of times we will try to stat a storage within the storage_timeout.
Introduce a circuitbreaker_backoff_threshhold: This is the number of failures required before we start backing off the access attempts. Instead of starting to back off after one failure.

The different failure scenarios would then look like this:

failure_count > 0 and failure_count < backoff_threshold: Keep trying
failure_count > backoff_threshold and failure_count < failure_count_threshold: Block for a limited time (failure_wait_time) after the last failure
failure_count > failure_count_threshold: Circuitbreaker tripped, the shard is probably down, manual intervention needed.

Since all the properties are currently in configuration files, we can't easily adjust them to work correctly for GitLab.com. To make that easier, we should collect these properties in the database:

failure_count_threshold: number of failures before stopping attempts: this value should be at least 2 times backoff_threshold
failure_wait_time: Seconds after an access failure before allowing access again
failure_reset_time: Time in seconds to expire failures in redis.
storage_timeout: Time in seconds to wait before aborting a storage access attempt
access_retries: The number of attempts to stat withing storage_timeout
backoff_threshold: The number of failures after which we start backing off access attempts for failure_wait_time: This value should be set to at least 2 times the number of workers we have (40 * 2).

To make it possible to try the Circuitbreaker on Canary, we will make it possible to enable the feature using an ENV variable.

Edited Oct 16, 2017 by Bob Van Landuyt