NFS-Ganesha services are not responding
Summary
All pods trying to write to any volume provisioned by nfs-ganesha hang, claims are mounted on md0 nodes (there's also very high load, ~300 on those nodes, but they're operational) - attempt to list files in the mount directory or performing any operation on the pod results in hang application (awaiting I/O) - I've checked and there is traffic visible on port 2049 , there are not much logs in the pod itself, not sure where else to check or how to debug further
related references
NFS-Ganesha upstream issue: https://github.com/nfs-ganesha/nfs-ganesha/issues/1295
NFS-Ganesha-server-and-provisioner upstream issue https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/issues/155
Details
All NFS service (i.e. NFS, rpcbind, mountd and lockd) become unresponsive. It happens when intensive IO run on the NFS Ganesha.
Below are the observation after the NFS services crashes:
- NFS ganesha POD remains in running state. No issue was reported by the logs of the POD.
- NFS mount failed.
- RPCinfo command shows that all nfs services like nfs, rpcbind, lockd and mountd doesn't respond:
bash-5.2# ./ps
PID USER TIME COMMAND
1 root 0:38 /nfs-provisioner -provisioner=cluster.local/nfs-ganesha-nfs-server-provisioner -device-based-fsids=false
14 rpc 0:05 /usr/sbin/rpcbind -w
16 rpcuser 0:00 /usr/sbin/rpc.statd --port 662
19 dbus 0:04 dbus-daemon --system --nopidfile
20 root 1:26 ganesha.nfsd -F -L /export/ganesha.log -p /var/run/ganesha.pid -f /export/vfs.conf
243 root 0:00 bash
255 root 0:00 ./ps
bash-5.2# ./ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if39: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8950 qdisc noqueue qlen 1000
link/ether ba:ef:ca:4e:86:36 brd ff:ff:ff:ff:ff:ff
inet 100.72.3.94/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::b8ef:caff:fe4e:8636/64 scope link
valid_lft forever preferred_lft forever
bash-5.2# rpcinfo -p
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 662 status
100024 1 tcp 662 status
100003 3 udp 2049 nfs
100003 3 tcp 2049 nfs
100005 1 udp 20048 mountd
100005 1 tcp 20048 mountd
100005 3 udp 20048 mountd
100005 3 tcp 20048 mountd
100021 4 udp 32803 nlockmgr
100021 4 tcp 32803 nlockmgr
100003 4 udp 2049 nfs
100003 4 tcp 2049 nfs
100011 1 udp 875 rquotad
100011 1 tcp 875 rquotad
100011 2 udp 875 rquotad
100011 2 tcp 875 rquotad
bash-5.2# rpcinfo -t localhost nfs
localhost: RPC: Remote system error - Connection timed out
bash-5.2# rpcinfo -u localhost nfs
rpcinfo: RPC: Timed out
program 100003 version 0 is not available
bash-5.2# rpcinfo -u localhost nfs 3
rpcinfo: RPC: Timed out
program 100003 version 3 is not available
bash-5.2# uname -a
Linux nfs-ganesha-nfs-server-provisioner-0 6.4.0-150600.23.42-default #1 SMP PREEMPT_DYNAMIC Fri Mar 7 09:53:00 UTC 2025 (7bf6ecd) x86_64 x86_64 x86_64 GNU/Linux
bash-5.2#
- CPU utilization went beyond 200-300% before NFS service becomes unresponsive. Most of the CPUs were consumed by process called
nfs-ganeshadandapps.
Additional Info:
- NFS server works well after scaling down all the PODs running high IO intensive applications.
- Node info: RAM = 30GB,
- Sylva version: 1.3.10