NFS-Ganesha services are not responding

Summary

All pods trying to write to any volume provisioned by nfs-ganesha hang, claims are mounted on md0 nodes (there's also very high load, ~300 on those nodes, but they're operational) - attempt to list files in the mount directory or performing any operation on the pod results in hang application (awaiting I/O) - I've checked and there is traffic visible on port 2049 , there are not much logs in the pod itself, not sure where else to check or how to debug further

related references

NFS-Ganesha upstream issue: https://github.com/nfs-ganesha/nfs-ganesha/issues/1295

NFS-Ganesha-server-and-provisioner upstream issue https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/issues/155

Details

All NFS service (i.e. NFS, rpcbind, mountd and lockd) become unresponsive. It happens when intensive IO run on the NFS Ganesha.

Below are the observation after the NFS services crashes:

NFS ganesha POD remains in running state. No issue was reported by the logs of the POD.
NFS mount failed.
RPCinfo command shows that all nfs services like nfs, rpcbind, lockd and mountd doesn't respond:

bash-5.2# ./ps
PID   USER     TIME  COMMAND
    1 root      0:38 /nfs-provisioner -provisioner=cluster.local/nfs-ganesha-nfs-server-provisioner -device-based-fsids=false
   14 rpc       0:05 /usr/sbin/rpcbind -w
   16 rpcuser   0:00 /usr/sbin/rpc.statd --port 662
   19 dbus      0:04 dbus-daemon --system --nopidfile
   20 root      1:26 ganesha.nfsd -F -L /export/ganesha.log -p /var/run/ganesha.pid -f /export/vfs.conf
  243 root      0:00 bash
  255 root      0:00 ./ps
bash-5.2# ./ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0@if39: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8950 qdisc noqueue qlen 1000
    link/ether ba:ef:ca:4e:86:36 brd ff:ff:ff:ff:ff:ff
    inet 100.72.3.94/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::b8ef:caff:fe4e:8636/64 scope link
       valid_lft forever preferred_lft forever
bash-5.2# rpcinfo -p
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp    662  status
    100024    1   tcp    662  status
    100003    3   udp   2049  nfs
    100003    3   tcp   2049  nfs
    100005    1   udp  20048  mountd
    100005    1   tcp  20048  mountd
    100005    3   udp  20048  mountd
    100005    3   tcp  20048  mountd
    100021    4   udp  32803  nlockmgr
    100021    4   tcp  32803  nlockmgr
    100003    4   udp   2049  nfs
    100003    4   tcp   2049  nfs
    100011    1   udp    875  rquotad
    100011    1   tcp    875  rquotad
    100011    2   udp    875  rquotad
    100011    2   tcp    875  rquotad
bash-5.2# rpcinfo -t localhost nfs
localhost: RPC: Remote system error - Connection timed out
bash-5.2# rpcinfo -u localhost nfs
rpcinfo: RPC: Timed out
program 100003 version 0 is not available
bash-5.2# rpcinfo -u localhost nfs 3
rpcinfo: RPC: Timed out
program 100003 version 3 is not available
bash-5.2# uname -a
Linux nfs-ganesha-nfs-server-provisioner-0 6.4.0-150600.23.42-default #1 SMP PREEMPT_DYNAMIC Fri Mar  7 09:53:00 UTC 2025 (7bf6ecd) x86_64 x86_64 x86_64 GNU/Linux
bash-5.2#

CPU utilization went beyond 200-300% before NFS service becomes unresponsive. Most of the CPUs were consumed by process called nfs-ganeshad and apps.

Additional Info:

NFS server works well after scaling down all the PODs running high IO intensive applications.
Node info: RAM = 30GB,
Sylva version: 1.3.10

Edited Jul 08, 2025 by Mohan Sharma