Excessive NFS network traffic due to TEST_STATEID requests
Two different customers (https://gitlab.zendesk.com/agent/tickets/103954, GitLab 11.2.3, https://gitlab.zendesk.com/agent/tickets/104826, https://gitlab.zendesk.com/agent/tickets/105200, GitLab 10.7.0, previous: GitLab 9.5) have been seeing significant performance degradation over NFS after upgrading to certain versions of GitLab.
In both cases, we see that the NFS network traffic (on port 2049) was dominated by repeated TEST_STATEID requests.
Unmounting the drive clears the problem.
On one of the tickets, we see these errors:
Sep 29 03:17:29 gitlab-cf0fe9a0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Sep 29 03:17:29 gitlab-cf0fe9a0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Sep 29 03:17:29 gitlab-cf0fe9a0 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Sep 29 03:17:35 gitlab-cf0fe9a0 kernel: nfs4_reclaim_open_state: 1662 callbacks suppressed
Possibly related links:
- https://access.redhat.com/solutions/1117763
- https://www.spinics.net/lists/linux-nfs/msg69416.html
- https://www.spinics.net/lists/linux-nfs/msg66582.html
- https://lists.debian.org/debian-kernel/2017/12/msg00214.html
- https://bugzilla.redhat.com/show_bug.cgi?id=1582186
- https://www.spinics.net/lists/linux-nfs/msg56688.html
- https://www.spinics.net/lists/linux-nfs/msg60753.html
- https://www.spinics.net/lists/linux-nfs/msg60760.html
Yes, typically a server reboot will cause the client to reclaim its state. If the server isn't restarting then you probably have a situation where the client and server have gotten out of sync in some fashion, the client is realizing it and attempting to reclaim its state.
One thing that could (potentially) cause this is a nfs4_unique_id collision. You might want to survey your clients and ensure that there aren't any.
From https://tools.ietf.org/html/rfc5661:
When delegations are revoked, the server will return with the SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent SEQUENCE operations. The client should note this and then use TEST_STATEID to find which delegations have been revoked.
Indeed, I see this in the Wireshark trace:
/cc: @dblessing, @lbot, @harishsr