Investigate what happens when the storage files are gone
@sytses got a comment that got me thinking here https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_31403847
I'm thinking about a degraded runtime mode for when an NFS server goes down hard. The question we need to answer now is: what would happen if in the case of an NFS server going down, instead of waiting for it to recover with the service being down we aggressively unmount the partition and mount a different NFS share there to accept writes.
I understand that reads will fail with a 404 (in some cases) and in some other cases, reads will make the application crash. We need to define how well the application can survive having the different NFS servers gone, and in case it works reasonably well, open a further issue to automate making the swap of NFS servers and the posterior recovery of the created files in the server.
The goal here is to reduce the MTTR to as short as possible, less than 4 minutes would be desirable given that our current MTTR is in the order of 40 minutes due to the time it takes to a host to reboot in Azure.
We could perform this tests in Staging, maybe use more NFS servers, and crash one to exercise the NFS swap change.