nfs-file01 rebooted unexpectedly
Timeline
-
10:21 - GitLab.com starts throwing 500 errors.
-
10:29 - By looking at the graphs we traced down the problem to
nfs-file01and we immediately find out that it restarted unexpectedly.
$ uptime 08:30:10 up 1 min, 1 user, load average: 0.40, 0.15, 0.06
* 10:33 - We restart the GitLab service on the web front-ends hoping it'll reset the NFS stale handlers.
* 10:35 - Seeing that nothing is improving we issue a OS reboot on the web front-ends.
* 10:44 - Seeing that the web front-ends weren't coming back online we assumed they were stuck at reboot so we power cycled them.
* 10:56 - The web front-ends are back online and the service is restored.
### What went wrong?
* We didn't get a clear alert pointing us to the root cause immediately.
* The application didn't recover automatically.
### What can do to improve?
* Create a specific alert for this kind of error.
* Create a runbook for NFS servers going down explaining how to recover the application.
* Check if we can improve the NFS mount to survive an NFS server reboot.
### Questions
* Why did `nfs-file01` reboot unexpectedly?
* Why did only web front-ends suffer from stale nfs handlers while all other front-ends continued to operate nominally?