How do we prevent outage upon NFS reboot?
We had two outages within 24 hrs relating to NFS servers being rebooted, see #1633 (moved) and #1624 (moved).
We're communicating with the cloud provider to get a better handle on at least knowing about these issues faster so outages can be limited in duration, but I'd like to understand what we can do / are doing to prevent such events from causing outages again. Presumably my question overlaps with existing efforts, but are we considering such things as:
- error message instead of outage?
- a level of HA for our NFS?
- Gitaly can be a partial solution, but not if it runs on the NFS
- Object storage can be a partial solution, but do we expect to move to fully object based storage?
Concrete ideas / to dos from comments thread:
-
what errors we were getting during that outage. For example, did all unicorns hang because they were timing out trying to access one NFS server?