gitaly errors on nfs-11/12
Its hard to figure out the connection to the error rate, but from the two (current) samples i have:
NFS 11
Errors start to increase: https://performance.gitlab.net/dashboard/db/triage-overview?refresh=1m&panelId=17&fullscreen&orgId=1&from=1510975800000&to=1511173206000
the latency of gitaly starts to increase, until it exceeds the 60s unicorn timeout https://performance.gitlab.net/dashboard/db/triage-overview?panelId=20&fullscreen&orgId=1&from=1510975800000&to=1511173206000
restarting gitaly removes the latency:
The latency degradation causes the unicorns to restart continually. once this is removed, the throughput can return to normal.
NFS 12 same happened on nfs-12 https://performance.gitlab.net/dashboard/db/triage-overview?orgId=1&from=1511173206000&to=1511174106000
cc @gl-infra @andrewn @jacobvosmaer-gitlab
We can continue to bounce the nodes, however it requires us to keep a very close eye on the graphs at all times.