Millions of open file descriptors for /usr/bin causing downtime (#3023) · Issues · OnGres Inc. / StackGres

Millions of open file descriptors for /usr/bin causing downtime

We went live with Stackgres in production just over a week ago. Since we've had two downtimes due to 'Too many open files'. It seems in our deployment, both stackgres-cluster-controller and stackgres-operator have an issue where they open a file handle for the directory `/usr/bin`, and never close it. This grows until a few days later we have nearly 2 million open file handles for /usr/bin and the other containers within the pod crash/restarts continually. The numbers below show how many file handles have opened since this morning when we had the downtime and I restarted the pods. We use the helm chart for the install, and below is the CDKTF for Python code we use to deploy it. As you can see it's a pretty basic setup. ```python # Install the helm chart release = HelmRelease( self, name='stackgres-operator', namespace=namespace.metadata.name_input, provider=self.helm_provider, repository='https://stackgres.io/downloads/stackgres-k8s/stackgres/helm/', chart='stackgres-operator', version='1.16.2', depends_on=[namespace], values=[ yaml.safe_dump({ 'adminui': { 'service': { 'exposeHTTP': True, }, }, 'imagePullSecrets': [{'name': 'default-registry-secret'}], 'operator': {'affinity': affinity}, 'restapi': {'affinity': affinity}, }) ], ) ``` Below is the output from one of our nodes that is running the stackgres operator. You can see within 12 hours it has around 900,000 open file handles for /usr/bin: ```bash root@d11:~# lsof 2>/dev/null | awk '{print substr($0, 1, 17)}' | sort | uniq -c | sort -nr | head -n 10 874540 stackgres 2871525 120243 java 2888590 101530 java 2888556 21771 wdavdaemo 2792976 10983 container 1731 5810 wdavdaemo 204772 4956 wdavdaemo 3805083 4200 kubelite 1321057 3216 k8s-dqlit 1734 2715 uwsgi 2890605 root@d11:~# ps aux | grep 2871525 ryan 2871525 11.1 1.2 4375452 391568 ? Ssl 03:13 112:36 /app/stackgres-operator -Dquarkus.http.host=0.0.0.0 -Dquarkus.http.port=8080 -Dquarkus.http.ssl-port=8443 -Djava.util.lo root@d11:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | wc -l 922452 root@d11:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | head -n 3 stackgres 2871525 ryan 32r DIR 0,194 4096 10262534 /usr/bin stackgres 2871525 ryan 35r DIR 0,194 4096 10262534 /usr/bin stackgres 2871525 ryan 38r DIR 0,194 4096 10262534 /usr/bin ``` Below is the output from one of our nodes that is running a cluster and thus stackgres-cluster-controller. You can see after 12 hours it's sitting around 250,000. This caused us the downtime when it reached 2 million. ```bash root@d2:~# lsof 2>/dev/null | awk '{print substr($0, 1, 17)}' | sort | uniq -c | sort -nr | head -n 10 238380 stackgres 1250167 27573 envoy 1250000 23712 cluster-a 1406 17591 wdavdaemo 2964659 5976 wdavdaemo 2964732 5486 container 1409 4956 wdavdaemo 1684511 1634 container 1249774 1160 wazuh-log 1580802 1118 calico-no 1098993 root@d2:~# ps aux | grep 1250167 mdatp 1250167 0.2 0.5 2839952 379248 ? Ssl 03:16 2:26 /app/stackgres-cluster-controller -Xmx535822337 -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager root@d2:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | wc -l 253394 root@d2:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | head -n 3 stackgres 1250167 mdatp 42r DIR 0,247 4096 16811757 /usr/bin stackgres 1250167 mdatp 43r DIR 0,247 4096 16811757 /usr/bin stackgres 1250167 mdatp 44r DIR 0,247 4096 16811757 /usr/bin ``` Would appreciate any insights or assistance. Stackgres is a fantastic product and we really want to make good use of it. Thanks. Please note any reference above to mdatp and ryan refer to users 999 and 1000 respectively within their pods (same UID).

issue