nfs-10 reboot
Creating the issue for the record. Timeline: - 20:50 UTC -- PagerDuty alerted that nfs-10 is down - 20:51 UTC -- ack, investigation started - 20:52 UTC -- I was able to ssh in, see the load of ~600 and growing, with about 250 nfsd processes in a locked state. - 20:55 UTC -- alert cleared, number of locked processes started to go down Graphs: - fleet overview: https://performance.gitlab.net/dashboard/db/fleet-overview?orgId=1&from=1514580374689&to=1514581037583 Slack discussion start: https://gitlab.slack.com/archives/C101F3796/p1514580653000183 Errors in dmesg appeared 2 minutes after it was rebooted: ``` [ 240.404065] INFO: task nfsd:1876 blocked for more than 120 seconds. [ 240.406347] Not tainted 4.4.0-104-generic #127-Ubuntu [ 240.408273] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ```
issue