Yes, typically a server reboot will cause the client to reclaim its
state. If the server isn't restarting then you probably have a situation
where the client and server have gotten out of sync in some fashion, the
client is realizing it and attempting to reclaim its state.
One thing that could (potentially) cause this is a nfs4_unique_id
collision. You might want to survey your clients and ensure that there
aren't any.
When delegations are revoked, the server will return with the
SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent
SEQUENCE operations. The client should note this and then use
TEST_STATEID to find which delegations have been revoked.
I spent a few hours this morning looking over the packet captures and researching this further and so far I can only confirm that the captures show the behavior you described. There isn't any other significant information in them beyond the number of tests, the lock ids, and NFS responses. The test_stateid commands are all for the same lock id and the server is universally responding with NFS4_OK.
We can rule out a few issues that we turned up in our searches based on the consistent NFS4_OK returns. We can also rule out issues like networking and routing because the sequence numbers line up without any omission.
We can rule out one of the github commits you linked to: https://github.com/torvalds/linux/commit/994b15b983a72e1148a173b61e5b279219bb45ae This is describing a loop of "resending the IO" which is different than a loop of sending test_stateid. We can also rule out the issue in the RHEL page I linked because all the tests are coming back NFS4_OK.
I searched internal tools for any other customer who had seen this issue but I couldn't find any. There is a substantial case history in our support tooling, and we have an officially supported NFS 4.1 based product called EFS. This makes it notable that the only results I found for test_stateid were other NFS cases in which reports were generated which enumerated the different types of NFS calls in various TCP dumps. Across roughly ~20 of these enumerations the quantity would be something like 4-10 total test_stateid commands out of 10,000+ NFS commands. I turned up these cases simply because they held the string "test_stateid" but the issues were varied and not specifically regarding test_stateid.
I found one report on the internet that issuing multiple locks for the same file can cause this behavior after the state of the locks is lost. For example, if a client is rebooted it must send test_stateid commands to regain the state of it's former locks. If 1,000 locks are issued for the same file it will send out 1,000 tests for that lock:
https://www.spinics.net/lists/linux-nfs/msg52916.html
Based on all of the above I suspect that there is something specific about your setup and the GitLab software that is causing multiple locks to be retained for the same file, or otherwise that intersect with a kernel/NFS bug that hasn't been documented by AWS support before.
One thing I'm working through in my mind is why the loop wouldn't be interrupted by another client asking for permission to use the specific file who's state is being checked. I see plainly in the TCP dump that the same lock is being tested for every test_stateid command. If node 1 is going through a loop of test_stateid -> NFS4_OK, this should eventually get a different response which tells node 1 that node 2 needs to use the file. This might cause node 2 to stop trying to test it's lock and re-acquire it instead. We could test this trivially if we knew which file was causing the issue and simply read it from another node.
If the many-locks/one-file file hypothesis is correct we should expect that opening a file for writing would not break the loop because the client is sequentially working through a test for each and every lock. Conversely we should expect that the "loop" would eventually end because there should be a finite, perhaps very large, number of locks to test. If the cluster is totally unable to move forward because it's waiting for files to be available I personally am suspecting that the many-locks/one-file hypothesis is most likely.
In the above link a user describes manually triggering their many-locks/one-file issue by rebooting an instance, but a network change or interruption should be capable of forcing the client to recheck all of it's states as well. Is there any type of similar event leading up to the issue? Are ECS nodes being added when it takes place? Is there a threshold rate of test_stateid commands which causes you to notice the issue? Perhaps we could measure the rate of the commands at various intervals after the mount is remounted to see if it is slowly increasing over time.
Is there a way we could dump out a listing of the currently delegated file locks from NFS? Could we script the client to release it's delegated locks when the issue is detected?
One idea I have is that it might be possible to probe the number of delegated locks for a single file over time by remounting the NFS filesystem, waiting a known period of time, and then causing a networking interruption. Imagine if you ran 'ifdown eth0 && sleep 10 ; ifup eth0 the interface, then start a TCP dump during the 10 seconds between ifup and ifdown. The resulting TCP dump would have one test_stateid per delegated lock. Using this, perhaps we could construct a body of evidence that the number of locks on a particular file are increasing over time, and eventually causing this cascade of test_stateid traffic when they need to be checked sequentially.
Did gitlab confirm that it would be okay to test with delegation disabled? If a small hit in general performance mitigates an issue which causes a complete stoppage of the cluster, then it should be well worth it to disable delegation. If an upcoming kernel patch fixes the issue or if we can reproduce the issue in a test environment this strategy could bridge the gap in time till resolution. I checked briefly on the performance impact of delegation by comparing NFS 3 to NFS 4 benchmarks and it seems to be just a few percent different in total performance.
The response from AWS is helpful. Some of my comments:
One thing I'm working through in my mind is why the loop wouldn't be interrupted by another client asking for permission to use the specific file who's state is being checked. I see plainly in the TCP dump that the same lock is being tested for every test_stateid command. If node 1 is going through a loop of test_stateid -> NFS4_OK, this should eventually get a different response which tells node 1 that node 2 needs to use the file. This might cause node 2 to stop trying to test it's lock and re-acquire it instead. We could test this trivially if we knew which file was causing the issue and simply read it from another node.
Right, I had similar thoughts in looking at the network capture. I see hundreds of NFS OPEN calls for the same file handle. Usually I can tell what filename is being used from the trace if the packet capture has enough context, but in this case we don't have that.
Is there a way we could dump out a listing of the currently delegated file locks from NFS? Could we script the client to release it's delegated locks when the issue is detected?
lsof -N shows the open NFS file handles. We should definitely run this when the problem starts.
We might also run strace on the gitaly or gitaly-ruby processes to see if there are clues that suggest which files are being opened.
Did gitlab confirm that it would be okay to test with delegation disabled? If a small hit in general performance mitigates an issue which causes a complete stoppage of the cluster, then it should be well worth it to disable delegation. If an upcoming kernel patch fixes the issue or if we can reproduce the issue in a test environment this strategy could bridge the gap in time till resolution. I checked briefly on the performance impact of delegation by comparing NFS 3 to NFS 4 benchmarks and it seems to be just a few percent different in total performance.
I think we said it should be okay, and that you went ahead with it. It's possible we still need to remount the shares on the clients for them to get the message.
We saw this happen today with the second customer with RedHat Enterprise 7.3 (3.10.06-693.21.1). When we first arrived on the call, we noticed iptables blocking all NFS traffic. We flushed those rules via iptables -F INPUT, and then NFS started working again. But then load got really busy, and the network traffic showed repeated TEST_STATEID events for the same ID. We turned on kernel NFS debugging via rpcdebug -m nfs -s all. /var/log/messages repeated with lots of messages:
I noticed in all the network traces I have, the SEQ4_STATUS_RECALLABLE_STATE_REVOKED bit is on the packets. From https://tools.ietf.org/html/rfc5661:
SEQ4_STATUS_RECALLABLE_STATE_REVOKED
When set, indicates that one or more recallable objects have been
revoked without expiration of the lease period, due to the
client's failure to return them when recalled, which may be a
consequence of there being no working backchannel and the client
failing to re-establish a backchannel per the
SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or
SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit
remains set on all SEQUENCE replies until the loss of all such
locks has been acknowledged by use of FREE_STATEID.
I've found this extremely useful on clients in tracking down 'lost' delegations.
echo "error != 0" | tee /sys/kernel/debug/tracing/events/nfs4/nfs4_delegreturn_exit/filter
...and then look in here:
cat /sys/kernel/debug/tracing/trace
It may be that we are tripping this bug more frequently because we leave a lot of files open. For example, on file-19, I see 35,116 open files, 200 of which have been deleted:
Here you can see delegreturn is at 2112. In this example, that number should stay at 2112 if delegations are disabled. If delegations are enabled, the delegreturn number should increase.
A quick way to test this is to do the following:
On Node A, run nfsstat -c. Record the output.
On Node A, write a file to the NFS mount (e.g. echo hello > /var/opt/gitlab/git-data/test.txt)
On Node B, remove that file (e.g. rm /var/opt/gitlab/git-data/test.txt). It's important to remove the file on another node since this will cause the NFS server to tell Node A to release the delegation.
On Node A, run nfsstat -c again and compare the values.
GitLab also ships with node_exporter, which records these metrics to the Prometheus time-series database. The name for this Prometheus metric for this is:
Bug 1552203 - RHEL7.4 NFS4.1 client connected to Linux NFS server sends repeated SEQUENCE / TEST_STATEIDs to the NFS server and receives NFS4_OK but SEQUENCE Reply has SEQ4_STATUS_RECALLABLE_STATE_REVOKED set
It appears likely this commit yet to be backported to RHEL7 may explain this issue:commit 95da1b3a5aded124dd1bda1e3cdb876184813140Author: Andrew Elble <aweits@rit.edu>Date: Fri Nov 3 14:06:31 2017 -0400 nfsd: deal with revoked delegations appropriately If a delegation has been revoked by the server, operations using that delegation should error out with NFS4ERR_DELEG_REVOKED in the >4.1 case, and NFS4ERR_BAD_STATEID otherwise. The server needs NFSv4.1 clients to explicitly free revoked delegations. If the server returns NFS4ERR_DELEG_REVOKED, the client will do that; otherwise it may just forget about the delegation and be unable to recover when it later sees SEQ4_STATUS_RECALLABLE_STATE_REVOKED set on a SEQUENCE reply. That can cause the Linux 4.1 client to loop in its stage manager. Signed-off-by: Andrew Elble <aweits@rit.edu> Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.cindex ecbc7b0..b8281776 100644--- a/fs/nfsd/nfs4state.c+++ b/fs/nfsd/nfs4state.c@@ -4016,7 +4016,8 @@ static struct nfs4_delegation *find_deleg_stateid(struct nfs4_client *cl, statei { struct nfs4_stid *ret;- ret = find_stateid_by_type(cl, s, NFS4_DELEG_STID);+ ret = find_stateid_by_type(cl, s,+ NFS4_DELEG_STID|NFS4_REVOKED_DELEG_STID); if (!ret) return NULL; return delegstateid(ret);@@ -4039,6 +4040,12 @@ static bool nfsd4_is_deleg_cur(struct nfsd4_open *open) deleg = find_deleg_stateid(cl, &open->op_delegate_stateid); if (deleg == NULL) goto out;+ if (deleg->dl_stid.sc_type == NFS4_REVOKED_DELEG_STID) {+ nfs4_put_stid(&deleg->dl_stid);+ if (cl->cl_minorversion)+ status = nfserr_deleg_revoked;+ goto out;+ } flags = share_access_to_flags(open->op_share_access); status = nfs4_check_delegmode(deleg, flags); if (status) {@@ -4908,6 +4915,16 @@ static __be32 nfsd4_validate_stateid(struct nfs4_client *cl, stateid_t *stateid) struct nfs4_stid **s, struct nfsd_net *nn) { __be32 status;+ bool return_revoked = false;++ /*+ * only return revoked delegations if explicitly asked.+ * otherwise we report revoked or bad_stateid status.+ */+ if (typemask & NFS4_REVOKED_DELEG_STID)+ return_revoked = true;+ else if (typemask & NFS4_DELEG_STID)+ typemask |= NFS4_REVOKED_DELEG_STID; if (ZERO_STATEID(stateid) || ONE_STATEID(stateid)) return nfserr_bad_stateid;@@ -4922,6 +4939,12 @@ static __be32 nfsd4_validate_stateid(struct nfs4_client *cl, stateid_t *stateid) *s = find_stateid_by_type(cstate->clp, stateid, typemask); if (!*s) return nfserr_bad_stateid;+ if (((*s)->sc_type == NFS4_REVOKED_DELEG_STID) && !return_revoked) {+ nfs4_put_stid(*s);+ if (cstate->minorversion)+ return nfserr_deleg_revoked;+ return nfserr_bad_stateid;+ } return nfs_ok; }
Follow-up comment:
I am marking for RHEL7.7 for now and we can do backports to z-streams as needed or even reconsider for RHEL7.6 if an urgent need is demonstrated and engineering agrees. Probably too late for RHEL7.6 but possibly not out of the question due to stable CC.
Running "./nfstest_delegation --runtest recall26" uncovers that
client doesn't recover the lock when we have an appending open,
where the initial open got a write delegation.
I wonder if CI traces are playing a part here: if lots of appends are happening from runners, we may be triggering this. I'm not sure if the upgrade from 9.5 to 10.7 introduced this, but it may be worthwhile to test if we can get an NFS client to fail by doing lots of appends with a CI runner bouncing between nodes.
Hello, @stanhu. We recently upgraded our OS to from RHEL 6.x to RHEL 7.6 and we have been facing this issue ever since. Whenever this happens, the CPU consumption goes up soaring and we keep seeing below errors in var/log/messages:
This causes issues to GitLab as it is unable to process any requests and users keep seeing 502s. Is there a fix for this? Any suggestion on what can be done to prevent this or any remedy measures?
Thank you. Posting some information that might help others. If you cannot disable NFS server delegation for some reason, following workaround worked for us:
Stop the GitLab container (kills all the gitlab specific processes running on the server).
BZ - 1552203 - RHEL7.4 NFS4.1 client and server repeated SEQUENCE / TEST_STATEIDs with SEQUENCE Reply has SEQ4_STATUS_RECALLABLE_STATE_REVOKED set - NFS server should return NFS4ERR_DELEG_REVOKED or NFS4ERR_BAD_STATEID for revoked delegations
: