SGCluster failing to start when node running on FIPS mode
Summary, current behaviour
FIPS mode is a enhanced security mode that ensures only certain cryptographic algorithms and best practices are used on a system. Postgres depends on SSL library which is FIPS-aware. There's a relevant post about Postgres and FIPS mode worth reviewing.
FIPS mode is a OS setting, but it's exposed via the kernel's /proc interface, and therefore becomes a "node setting". So "running on FIPS mode" means essentially running a container on a worker node that has been configured for FIPS mode.
And when running StackGres on a FIPS-enabled worker node, SGCluster fails to create with the following message:
FATAL: password encryption failed: disabled for FIPS
STATEMENT: ALTER USER "postgres" WITH PASSWORD **********
Steps to reproduce
-
Setup a FIPS-enabled worker node. OS distros like RHEL and Ubuntu Pro support it. Here it was reproduced with RHEL 8. More information about FIPS on RHEL.
-
Log in / create account with RedHat, download evaluation version of RHEL (60 days, login is sufficient). Tried with 8.5 ISO (since 8.5 version seems to be one fully FIPS validated).
-
For simplicity, create a VM (e.g. KVM, Virtualbox) to install.
-
At installation boot, press [Tab] to edit command line options and append “fips=1” to the install command line.
-
Install.
-
Upon boot, check:
root@localhost ~]# fips-mode-setup --check FIPS mode is enabled. [root@localhost ~]# cat /proc/sys/crypto/fips_enabled 1 -
Install and configure k3s:
curl -sfL https://get.k3s.io | sh - … alias kubectl="sudo /usr/local/bin/k3s kubectl" -
Install Helm. Install StackGres with Helm and create a basic cluster. Create simple cluster. It fails:
simple-0 5/6 CrashLoopBackOff 5 (2s ago) 4m48s simple-0 5/6 Running 6 (2m49s ago) 7m35s -
Events:
aht@localhost ~]$ kubectl logs simple-0 -c patroni 6m21s Normal ClusterPgBouncerConfigUpdated SGCluster/simple Patroni config updated 2m51s (x21 over 6m18s) Warning BackOff Pod/simple-0 Back-off restarting failed container patroni in pod simple-0_default(139aed26-9d1a-4acc-b87d-0d2fd74f168f) 21s (x8 over 6m21s) Warning ClusterControllerFailed SGCluster/simple An error occurred while reconciling patroni configuration: Process with pattern ^[^ ]+ /usr/bin/patroni .*$ not found 19s (x39 over 8m6s) Normal ClusterUpdated SGCluster/simple Cluster default.simple updated: StatefulSet:simple (+/metadata/creationTimestamp -> 2023-12-31T18:53:04Z), StatefulSet:simple (+/metadata/generation -> 1) and other 46 resources where patched -
Logs:
aht@localhost ~]$ kubectl logs simple-0 -c patroni … creating configuration files ... ok running bootstrap script ... ok 2023-12-31 19:00:40.397 UTC [2142] FATAL: password encryption failed: disabled for FIPS 2023-12-31 19:00:40.397 UTC [2142] STATEMENT: ALTER USER "postgres" WITH PASSWORD E'****-****-****-***'; child process exited with exit code 1 initdb: removing data directory "/var/lib/postgresql/data" 2023-12-31 19:00:40,406 INFO: removing initialize key after failed attempt to bootstrap the cluster Traceback (most recent call last): File "/usr/bin/patroni", line 8, in <module> sys.exit(main()) File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 345, in main return patroni_main(args.configfile) File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 239, in patroni_main abstract_main(Patroni, configfile) File "/usr/lib/python3.9/site-packages/patroni/daemon.py", line 174, in abstract_main controller.run() File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 194, in run super(Patroni, self).run() File "/usr/lib/python3.9/site-packages/patroni/daemon.py", line 143, in run self._run_cycle() File "/usr/lib/python3.9/site-packages/patroni/__main__.py", line 203, in _run_cycle logger.info(self.ha.run_cycle()) File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1972, in run_cycle info = self._run_cycle() File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1785, in _run_cycle return self.post_bootstrap() File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1669, in post_bootstrap self.cancel_initialization() File "/usr/lib/python3.9/site-packages/patroni/ha.py", line 1662, in cancel_initialization raise PatroniFatalException('Failed to bootstrap cluster') patroni.exceptions.PatroniFatalException: Failed to bootstrap cluster performing post-bootstrap initialization …
Expected Behaviour
Cluster should be created without issues.
Possible Solution
The password_encryption parameter is not explicitly being set to scram-sha-256, but it is the default value since Postgres 14 (and Postgres 16 has been tested here). Still, an explicit test can be performed by creating the following SGPostgresConfig:
apiVersion: stackgres.io/v1
kind: SGPostgresConfig
metadata:
name: pgconfig1
spec:
postgresVersion: "16"
postgresql.conf:
password_encryption: 'scram-sha-256'
and referencing it from the SGCluster. However, results where the same, the cluster creation fails at the same step.
The problem seems to be that despite this default value, at some point an MD5 password is attempted to be created. Note that MD5 is forbidden in FIPS mode. That's why this same cluster creation (with or without the above Postgres configuration) works on a non-FIPS enabled cluster without a problem.
Examining this on a non-FIPS cluster, we note that /etc/patroni/config.yml contains:
bootstrap:
post_init: '/usr/local/bin/post-init.sh'
initdb:
- auth-host: md5
- auth-local: trust
- encoding: UTF8
- locale: C.UTF-8
- data-checksums
pg_hba:
- 'host all all 0.0.0.0/0 md5'
- 'host replication replicator 0.0.0.0/0 md5'
So it appears that MD5 is being set at cluster bootstrap (initdb) as well as for pg_hba.conf. The latter should not be a problem, as according to Postgres documentation on password authentication "if md5 is specified as a method in pg_hba.conf but the user's password on the server is encrypted for SCRAM (see below), then SCRAM-based authentication will automatically be chosen instead".
But the former probably is. At initdb time Postgres parameters are not configured yet, and probably if MD5 is explicitly requested it may be honored as such.
It is surprising that creating the cluster on a non-FIPS environment shows that SCRAM-SHA-256 is used despite this setting!
postgres=# table pg_authid ;
oid | rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolreplication | rolbypassrls | rolconnlimit | rolpassword | rolvaliduntil
-------+-----------------------------+----------+------------+---------------+-------------+-------------+----------------+--------------+--------------+---------------------------------------------------------+---------------
10 | postgres | t | t | t | t | t | t | t | -1 | SCRAM-SHA-256$4096:************************************
16384 | replicator | f | t | f | f | t | t | f | -1 | SCRAM-SHA-256$4096:************************************
16385 | authenticator | t | t | f | f | t | f | f | -1 | SCRAM-SHA-256$4096:************************************
I'm not an expert on Patroni's source code nor Python, but here it is (probably) the relevant code being executed, and it may be creating the user twice (first with MD5, honoring the configuration request, and later with SCRAM-SHA-256 when the password_encryption value is honored).
Whatever the case is, it looks like these settings are hardcoding MD5 when they shouldn't. If I'm not mistaken, these settings are being generated from start-patroni.sh. To fix it, I propose to:
- Remove this hardcoding of MD5. Instead, we should use
scram-sha-256as a default iif the user has not explicitly requestedmd5as a value forpassword_encryption. Alternatively, we may remove this parameter altogether if we can verify that this behaviour is Patroni's default (I'm not sure if this is the case, but can be tested). - Also change the relevant lines in
.initdb.pg_hbathat reference tomd5and replace them following the same logic as above. - Remove/reconsider the use of
trustalso inpg_hbaand usepeerinstead. This requires testing to ensure container is executed with the samepostgresuser so thatkubectl exec -it ... -c postgres-util -- psqlstill works and doesn't break.
Environment
- StackGres version:
1.7.0, but also reported to happen on previous versions. - Kubernetes version:
1.28from K3s (see above). - Cloud provider or hardware configuration: VM with RHEL 8.5 with FIPS enabled (see above).