nfs-10 reboot
Creating the issue for the record. Timeline:
- 20:50 UTC -- PagerDuty alerted that nfs-10 is down
- 20:51 UTC -- ack, investigation started
- 20:52 UTC -- I was able to ssh in, see the load of ~600 and growing, with about 250 nfsd processes in a locked state.
- 20:55 UTC -- alert cleared, number of locked processes started to go down
Graphs:
- fleet overview: https://performance.gitlab.net/dashboard/db/fleet-overview?orgId=1&from=1514580374689&to=1514581037583
Slack discussion start: https://gitlab.slack.com/archives/C101F3796/p1514580653000183
Errors in dmesg appeared 2 minutes after it was rebooted:
```
[ 240.404065] INFO: task nfsd:1876 blocked for more than 120 seconds.
[ 240.406347] Not tainted 4.4.0-104-generic #127-Ubuntu
[ 240.408273] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
```
issue