Repository updates intermittently disappear on busy repos hosted on NFS mount shared by multiple Gitaly servers
Issue is occurring on a large customer with three Gitaly nodes sharing an NFS mount behind a load balancer, appearing to be a single storage to rails.
After upgrading to v13.3.2 several weeks ago users on several high-traffic projects on their instance have reported recently merged changes disappearing. No errors are received when merging, but when they check the updated file at a later time the change is not present. They had no reports of this issue when on v12.10.
This may be the same underlying issue as #2589 (closed).
Looking at a specific example, we found a GC had been running on a separate Gitaly host during the a UserMergeBranch was executed. The GC reported the following error:
{
"correlation_id": "GPhidQI0kw9",
"grpc.meta.auth_version": "v2",
"grpc.meta.client_name": "gitlab-sidekiq",
"grpc.meta.deadline_type": "unknown",
"grpc.method": "GarbageCollect",
"grpc.request.deadline": "2020-09-29T12:49:23Z",
"grpc.request.fullMethod": "/gitaly.RepositoryService/GarbageCollect",
"grpc.request.glProjectPath": "services/crm/Lightning/lightning-service/service",
"grpc.request.glRepository": "project-3515",
"grpc.request.repoPath": "services/crm/Lightning/lightning-service/service.git",
"grpc.request.repoStorage": "default",
"grpc.request.topLevelGroup": "services",
"grpc.service": "gitaly.RepositoryService",
"grpc.start_time": "2020-09-29T06:49:23Z",
"level": "error",
"msg": "warning: garbage found: /var/opt/gitlab-data/git-data/repositories/group/project/objects/pack/.nfs0000000002e6ff1b0002d23a
warning: garbage found: /var/opt/gitlab-data/git-data/repositories/group/project.git/objects/pack/.nfs0000000002e6a8590002d236
warning: garbage found: /var/opt/gitlab-data/git-data/repositories/group/project.git/objects/pack/.nfs0000000002e74aa30002d237
warning: garbage found: /var/opt/gitlab-data/git-data/repositories/group/project.git/objects/pack/.nfs0000000002b239f50002d238
warning: garbage found: /var/opt/gitlab-data/git-data/repositories/group/project.git/objects/pack/.nfs0000000002bf55ea0002d234",
"peer.address": "10.143.238.127:20084",
"pid": 16719,
"span.kind": "server",
"system": "grpc",
"time": "2020-09-29T06:50:44.402Z"
}
These .nfs... files indicate that a file with an open file handle against it was deleted. NFS will perform a silly rename when this occurs, renaming the file to the .nfs... format and keeping it on disk until the file handle is released. This suggests that something is deleting pack files out from under another process, though which specific processes are unclear.
In addition, we see intermittent rpc error: code = Unknown desc = Rugged::OSError: failed to read descriptor: Stale file handle messages on multiple Gitaly-Ruby RPCs. In these cases the request fails, so no data loss occurs.
Current NFS mount settings on Gitaly nodes:
nfs4 (rw,relatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.182.169.125,local_lock=none)
Note that the customer will be adding lookupcache=positive to their settings at their earliest opportunity.