Millions of open file descriptors for /usr/bin causing downtime

We went live with Stackgres in production just over a week ago. Since we've had two downtimes due to 'Too many open files'.

It seems in our deployment, both stackgres-cluster-controller and stackgres-operator have an issue where they open a file handle for the directory /usr/bin, and never close it. This grows until a few days later we have nearly 2 million open file handles for /usr/bin and the other containers within the pod crash/restarts continually.

The numbers below show how many file handles have opened since this morning when we had the downtime and I restarted the pods.

We use the helm chart for the install, and below is the CDKTF for Python code we use to deploy it. As you can see it's a pretty basic setup.

        # Install the helm chart
        release = HelmRelease(
            self,
            name='stackgres-operator',
            namespace=namespace.metadata.name_input,
            provider=self.helm_provider,
            repository='https://stackgres.io/downloads/stackgres-k8s/stackgres/helm/',
            chart='stackgres-operator',
            version='1.16.2',
            depends_on=[namespace],
            values=[
                yaml.safe_dump({
                    'adminui': {
                        'service': {
                            'exposeHTTP': True,
                        },
                    },
                    'imagePullSecrets': [{'name': 'default-registry-secret'}],
                    'operator': {'affinity': affinity},
                    'restapi': {'affinity': affinity},
                })
            ],
        )

Below is the output from one of our nodes that is running the stackgres operator. You can see within 12 hours it has around 900,000 open file handles for /usr/bin:

root@d11:~# lsof 2>/dev/null | awk '{print substr($0, 1, 17)}' | sort | uniq -c | sort -nr | head -n 10
 874540 stackgres 2871525
 120243 java      2888590
 101530 java      2888556
  21771 wdavdaemo 2792976
  10983 container    1731
   5810 wdavdaemo  204772
   4956 wdavdaemo 3805083
   4200 kubelite  1321057
   3216 k8s-dqlit    1734
   2715 uwsgi     2890605

root@d11:~# ps aux | grep 2871525
ryan     2871525 11.1  1.2 4375452 391568 ?      Ssl  03:13 112:36 /app/stackgres-operator -Dquarkus.http.host=0.0.0.0 -Dquarkus.http.port=8080 -Dquarkus.http.ssl-port=8443 -Djava.util.lo

root@d11:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | wc -l
922452

root@d11:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | head -n 3
stackgres 2871525                               ryan   32r      DIR              0,194      4096   10262534 /usr/bin
stackgres 2871525                               ryan   35r      DIR              0,194      4096   10262534 /usr/bin
stackgres 2871525                               ryan   38r      DIR              0,194      4096   10262534 /usr/bin

Below is the output from one of our nodes that is running a cluster and thus stackgres-cluster-controller. You can see after 12 hours it's sitting around 250,000. This caused us the downtime when it reached 2 million.

root@d2:~# lsof 2>/dev/null | awk '{print substr($0, 1, 17)}' | sort | uniq -c | sort -nr | head -n 10
 238380 stackgres 1250167
  27573 envoy     1250000
  23712 cluster-a    1406
  17591 wdavdaemo 2964659
   5976 wdavdaemo 2964732
   5486 container    1409
   4956 wdavdaemo 1684511
   1634 container 1249774
   1160 wazuh-log 1580802
   1118 calico-no 1098993

root@d2:~# ps aux | grep 1250167
mdatp    1250167  0.2  0.5 2839952 379248 ?      Ssl  03:16   2:26 /app/stackgres-cluster-controller -Xmx535822337 -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager

root@d2:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | wc -l
253394

root@d2:~# lsof 2>/dev/null | grep '^stackgres' | grep '/usr/bin$' | head -n 3
stackgres 1250167                              mdatp   42r      DIR              0,247       4096   16811757 /usr/bin
stackgres 1250167                              mdatp   43r      DIR              0,247       4096   16811757 /usr/bin
stackgres 1250167                              mdatp   44r      DIR              0,247       4096   16811757 /usr/bin

Would appreciate any insights or assistance. Stackgres is a fantastic product and we really want to make good use of it. Thanks.

Please note any reference above to mdatp and ryan refer to users 999 and 1000 respectively within their pods (same UID).

Edited by Ryan Butterfield