Millions of open file descriptors for /usr/bin causing downtime
We went live with Stackgres in production just over a week ago. Since we've had two downtimes due to 'Too many open files'.
It seems in our deployment, both stackgres-cluster-controller and stackgres-operator have an issue where they open a file handle for the directory /usr/bin, and never close it. This grows until a few days later we have nearly 2 million open file handles for /usr/bin and the other containers within the pod crash/restarts continually.
The numbers below show how many file handles have opened since this morning when we had the downtime and I restarted the pods.
We use the helm chart for the install, and below is the CDKTF for Python code we use to deploy it. As you can see it's a pretty basic setup.
# Install the helm chart
release = HelmRelease(
self,
name='stackgres-operator',
namespace=namespace.metadata.name_input,
provider=self.helm_provider,
repository='https://stackgres.io/downloads/stackgres-k8s/stackgres/helm/',
chart='stackgres-operator',
version='1.16.2',
depends_on=[namespace],
values=[
yaml.safe_dump({
'adminui': {
'service': {
'exposeHTTP': True,
},
},
'imagePullSecrets': [{'name': 'default-registry-secret'}],
'operator': {'affinity': affinity},
'restapi': {'affinity': affinity},
})
],
)
Below is the output from one of our nodes that is running the stackgres operator. You can see within 12 hours it has around 900,000 open file handles for /usr/bin:
root@d11:~# lsof 2>/dev/null | awk '{print substr($0, 1, 17)}' | sort | uniq -c | sort -nr | head -n 10
874540 stackgres 2871525
120243 java 2888590
101530 java 2888556
21771 wdavdaemo 2792976
10983 container 1731
5810 wdavdaemo 204772
4956 wdavdaemo 3805083
4200 kubelite 1321057
3216 k8s-dqlit 1734
2715 uwsgi 2890605
root@d11:~# ps aux | grep 2871525
ryan 2871525 11.1 1.2 4375452 391568 ? Ssl 03:13 112:36 /app/stackgres-operator -Dquarkus.http.host=0.0.0.0 -Dquarkus.http.port=8080 -Dquarkus.http.ssl-port=8443 -Djava.util.lo
root@d11:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | wc -l
922452
root@d11:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | head -n 3
stackgres 2871525 ryan 32r DIR 0,194 4096 10262534 /usr/bin
stackgres 2871525 ryan 35r DIR 0,194 4096 10262534 /usr/bin
stackgres 2871525 ryan 38r DIR 0,194 4096 10262534 /usr/bin
Below is the output from one of our nodes that is running a cluster and thus stackgres-cluster-controller. You can see after 12 hours it's sitting around 250,000. This caused us the downtime when it reached 2 million.
root@d2:~# lsof 2>/dev/null | awk '{print substr($0, 1, 17)}' | sort | uniq -c | sort -nr | head -n 10
238380 stackgres 1250167
27573 envoy 1250000
23712 cluster-a 1406
17591 wdavdaemo 2964659
5976 wdavdaemo 2964732
5486 container 1409
4956 wdavdaemo 1684511
1634 container 1249774
1160 wazuh-log 1580802
1118 calico-no 1098993
root@d2:~# ps aux | grep 1250167
mdatp 1250167 0.2 0.5 2839952 379248 ? Ssl 03:16 2:26 /app/stackgres-cluster-controller -Xmx535822337 -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager
root@d2:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | wc -l
253394
root@d2:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | head -n 3
stackgres 1250167 mdatp 42r DIR 0,247 4096 16811757 /usr/bin
stackgres 1250167 mdatp 43r DIR 0,247 4096 16811757 /usr/bin
stackgres 1250167 mdatp 44r DIR 0,247 4096 16811757 /usr/bin
Would appreciate any insights or assistance. Stackgres is a fantastic product and we really want to make good use of it. Thanks.
Please note any reference above to mdatp and ryan refer to users 999 and 1000 respectively within their pods (same UID).