Circuitbreaker to avoid and monitor access to stale NFS mounts
When an NFS currently becomes inaccessible, the
exists? cache is being incorrectly set when trying to access a repository on that NFS. Causing the repository to appear missing, as described in https://gitlab.com/gitlab-com/infrastructure/issues/1775.
If the repository call fails because of a
Rugged::OSError we will check if the
repository_storage_path is available using
Pathname#realpath. If it is, we will just raise the exception, since it could just mean the repository is missing, which is acceptable.
If the check using
Pathname#realpath raises an
Errno::EIO we will store the the time for this exception and increment a failure counter in redis. And raise the exception wrapped in a
The exception information stored in redis needs to be specific to a host and a storage path. Because having an NFS not respond on 1 system, does not mean it wont respond on another host.
If we have a second request, but the last failure was less then 5 seconds ago, we will not retry accessing the storage, again since the calls to the storage take a long time when the NFS is not responding. This will avoid clogging up the web-workers.
If there are more than 10 subsequent failures for the same storage on the same host, the request will be blocked since there's probably something wrong with the storage in that case and we raise a different exception.
There could be a view in the admin panel that shows the number of issues with a certain storage. And a button to clear the stored exceptions so the requests are executed again.
Next to that, we need to make sure the ruby app starts when a storage is not available. It is currently being blocked in the
06_validations initializer with the same