nfs-file01 rebooted unexpectedly

Timeline

  • 10:21 - GitLab.com starts throwing 500 errors.

  • 10:29 - By looking at the graphs we traced down the problem to nfs-file01 and we immediately find out that it restarted unexpectedly.

$ uptime 08:30:10 up 1 min, 1 user, load average: 0.40, 0.15, 0.06


* 10:33 - We restart the GitLab service on the web front-ends hoping it'll reset the NFS stale handlers.
* 10:35 - Seeing that nothing is improving we issue a OS reboot on the web front-ends.
* 10:44 - Seeing that the web front-ends weren't coming back online we assumed they were stuck at reboot so we power cycled them.
* 10:56 - The web front-ends are back online and the service is restored.

### What went wrong?

* We didn't get a clear alert pointing us to the root cause immediately.
* The application didn't recover automatically.

### What can do to improve?

* Create a specific alert for this kind of error.
* Create a runbook for NFS servers going down explaining how to recover the application.
* Check if we can improve the NFS mount to survive an NFS server reboot.

### Questions

* Why did `nfs-file01` reboot unexpectedly?
* Why did only web front-ends suffer from stale nfs handlers while all other front-ends continued to operate nominally?
Assignee Loading
Time tracking Loading