nfs-10 reboot

Creating the issue for the record. Timeline:

  • 20:50 UTC -- PagerDuty alerted that nfs-10 is down
  • 20:51 UTC -- ack, investigation started
  • 20:52 UTC -- I was able to ssh in, see the load of ~600 and growing, with about 250 nfsd processes in a locked state.
  • 20:55 UTC -- alert cleared, number of locked processes started to go down

Graphs:

Slack discussion start: https://gitlab.slack.com/archives/C101F3796/p1514580653000183

Errors in dmesg appeared 2 minutes after it was rebooted:

[  240.404065] INFO: task nfsd:1876 blocked for more than 120 seconds.
[  240.406347]       Not tainted 4.4.0-104-generic #127-Ubuntu
[  240.408273] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Edited by Ilya Frolov