SGCluster failing to start when node running on FIPS mode

Summary, current behaviour

FIPS mode is a enhanced security mode that ensures only certain cryptographic algorithms and best practices are used on a system. Postgres depends on SSL library which is FIPS-aware. There's a relevant post about Postgres and FIPS mode worth reviewing.

FIPS mode is a OS setting, but it's exposed via the kernel's /proc interface, and therefore becomes a "node setting". So "running on FIPS mode" means essentially running a container on a worker node that has been configured for FIPS mode.

And when running StackGres on a FIPS-enabled worker node, SGCluster fails to create with the following message:

FATAL: password encryption failed: disabled for FIPS
STATEMENT: ALTER USER "postgres" WITH PASSWORD  **********

Steps to reproduce

  1. Setup a FIPS-enabled worker node. OS distros like RHEL and Ubuntu Pro support it. Here it was reproduced with RHEL 8. More information about FIPS on RHEL.

  2. Log in / create account with RedHat, download evaluation version of RHEL (60 days, login is sufficient). Tried with 8.5 ISO (since 8.5 version seems to be one fully FIPS validated).

  3. For simplicity, create a VM (e.g. KVM, Virtualbox) to install.

  4. At installation boot, press [Tab] to edit command line options and append “fips=1” to the install command line.

  5. Install.

  6. Upon boot, check:

    root@localhost ~]# fips-mode-setup --check
    FIPS mode is enabled.
    
    [root@localhost ~]# cat /proc/sys/crypto/fips_enabled
    1
  7. Install and configure k3s:

    curl -sfL https://get.k3s.io | sh -
    
    alias kubectl="sudo /usr/local/bin/k3s kubectl"
  8. Install Helm. Install StackGres with Helm and create a basic cluster. Create simple cluster. It fails:

    simple-0   5/6 	CrashLoopBackOff   5 (2s ago)	4m48s
    simple-0   5/6 	Running        	6 (2m49s ago)   7m35s
  9. Events:

    aht@localhost ~]$ kubectl logs simple-0 -c patroni
    
    6m21s                       Normal  ClusterPgBouncerConfigUpdated   SGCluster/simple                                Patroni config updated
    2m51s (x21 over 6m18s)   Warning   BackOff                          Pod/simple-0                                    Back-off restarting failed container patroni in pod simple-0_default(139aed26-9d1a-4acc-b87d-0d2fd74f168f)
    21s (x8 over 6m21s)         Warning   ClusterControllerFailed       SGCluster/simple                                An error occurred while reconciling patroni configuration: Process with pattern ^[^ ]+ /usr/bin/patroni .*$ not found
    19s (x39 over 8m6s)         Normal  ClusterUpdated                  SGCluster/simple                                Cluster default.simple updated: StatefulSet:simple (+/metadata/creationTimestamp -> 2023-12-31T18:53:04Z), StatefulSet:simple (+/metadata/generation -> 1) and other 46 resources where patched
  10. Logs:

    aht@localhost ~]$ kubectl logs simple-0 -c patroni
    
    
    
    creating configuration files ... ok
    running bootstrap script ... ok
    2023-12-31 19:00:40.397 UTC [2142] FATAL:  password encryption failed: disabled for FIPS
    2023-12-31 19:00:40.397 UTC [2142] STATEMENT:  ALTER USER "postgres" WITH PASSWORD E'****-****-****-***';
    
    child process exited with exit code 1
    initdb: removing data directory "/var/lib/postgresql/data"
    2023-12-31 19:00:40,406 INFO: removing initialize key after failed attempt to bootstrap the cluster
    Traceback (most recent call last):
      File "/usr/bin/patroni", line 8, in <module>
        sys.exit(main())
      File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 345, in main
        return patroni_main(args.configfile)
      File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 239, in patroni_main
        abstract_main(Patroni, configfile)
      File "/usr/lib/python3.9/site-packages/patroni/daemon.py", line 174, in abstract_main
        controller.run()
      File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 194, in run
        super(Patroni, self).run()
      File "/usr/lib/python3.9/site-packages/patroni/daemon.py", line 143, in run
        self._run_cycle()
      File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 203, in _run_cycle
        logger.info(self.ha.run_cycle())
      File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1972, in run_cycle
        info = self._run_cycle()
      File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1785, in _run_cycle
        return self.post_bootstrap()
      File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1669, in post_bootstrap
        self.cancel_initialization()
      File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1662, in cancel_initialization
        raise PatroniFatalException('Failed to bootstrap cluster')
    patroni.exceptions.PatroniFatalException: Failed to bootstrap cluster
    performing post-bootstrap initialization …

Expected Behaviour

Cluster should be created without issues.

Possible Solution

The password_encryption parameter is not explicitly being set to scram-sha-256, but it is the default value since Postgres 14 (and Postgres 16 has been tested here). Still, an explicit test can be performed by creating the following SGPostgresConfig:

apiVersion: stackgres.io/v1
kind: SGPostgresConfig
metadata:
  name: pgconfig1
spec:
  postgresVersion: "16"
  postgresql.conf:
	password_encryption: 'scram-sha-256'

and referencing it from the SGCluster. However, results where the same, the cluster creation fails at the same step.

The problem seems to be that despite this default value, at some point an MD5 password is attempted to be created. Note that MD5 is forbidden in FIPS mode. That's why this same cluster creation (with or without the above Postgres configuration) works on a non-FIPS enabled cluster without a problem.

Examining this on a non-FIPS cluster, we note that /etc/patroni/config.yml contains:

bootstrap:
  post_init: '/usr/local/bin/post-init.sh'

  initdb:
  - auth-host: md5
  - auth-local: trust
  - encoding: UTF8
  - locale: C.UTF-8
  - data-checksums
  pg_hba:
  - 'host all all 0.0.0.0/0 md5'
  - 'host replication replicator 0.0.0.0/0 md5'

So it appears that MD5 is being set at cluster bootstrap (initdb) as well as for pg_hba.conf. The latter should not be a problem, as according to Postgres documentation on password authentication "if md5 is specified as a method in pg_hba.conf but the user's password on the server is encrypted for SCRAM (see below), then SCRAM-based authentication will automatically be chosen instead".

But the former probably is. At initdb time Postgres parameters are not configured yet, and probably if MD5 is explicitly requested it may be honored as such.

It is surprising that creating the cluster on a non-FIPS environment shows that SCRAM-SHA-256 is used despite this setting!

postgres=# table pg_authid ;
  oid  |           rolname           | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolreplication | rolbypassrls | rolconnlimit |           rolpassword                                   | rolvaliduntil                                                                               
-------+-----------------------------+----------+------------+---------------+-------------+-------------+----------------+--------------+--------------+---------------------------------------------------------+---------------
    10 | postgres                    | t        | t          | t             | t           | t           | t              | t            |           -1 | SCRAM-SHA-256$4096:************************************
 16384 | replicator                  | f        | t          | f             | f           | t           | t              | f            |           -1 | SCRAM-SHA-256$4096:************************************
 16385 | authenticator               | t        | t          | f             | f           | t           | f              | f            |           -1 | SCRAM-SHA-256$4096:************************************

I'm not an expert on Patroni's source code nor Python, but here it is (probably) the relevant code being executed, and it may be creating the user twice (first with MD5, honoring the configuration request, and later with SCRAM-SHA-256 when the password_encryption value is honored).

Whatever the case is, it looks like these settings are hardcoding MD5 when they shouldn't. If I'm not mistaken, these settings are being generated from start-patroni.sh. To fix it, I propose to:

  • Remove this hardcoding of MD5. Instead, we should use scram-sha-256 as a default iif the user has not explicitly requested md5 as a value for password_encryption. Alternatively, we may remove this parameter altogether if we can verify that this behaviour is Patroni's default (I'm not sure if this is the case, but can be tested).
  • Also change the relevant lines in .initdb.pg_hba that reference to md5 and replace them following the same logic as above.
  • Remove/reconsider the use of trust also in pg_hba and use peer instead. This requires testing to ensure container is executed with the same postgres user so that kubectl exec -it ... -c postgres-util -- psql still works and doesn't break.

Environment

  • StackGres version: 1.7.0, but also reported to happen on previous versions.
  • Kubernetes version: 1.28 from K3s (see above).
  • Cloud provider or hardware configuration: VM with RHEL 8.5 with FIPS enabled (see above).