Chef is unable to produce Postgres v12 nodes
Summary
The source of most of these issues is the placement of the Postgres configuration file in the data
directory.
- This file was located in the home directory of the
gitlab-psql
user- In cookbooks, this location was referred to as the
postgres_config_dir
- However, semantically, we overloaded that variable to refer to the user's home directory
- In cookbooks, this location was referred to as the
- A different location was required because the upgrade from
v11
tov12
required both versions' configuration files to be present during the upgrade- The file was placed inside
v12
's data directory, which is a relatively common practice - An update to the recipe was made so that
postgres_config_dir
pointed todata12
- The file was placed inside
- This had side-effects
- The location of the
gitlab-psql
home directory was moved todata12
, which moved various other files there - Postgres correctly expects the data directory to be empty when it's initialized
- The location of the
Status
- We have decoupled the locations of
postgres_config_dir
andpguser_home_dir
-
However, Chef generates a dummy
postgres.conf
in the configuration directory, and various cookbooks expect this to be case- The scope of changes is amplified significantly, so updating those cookbooks is unwise at the moment
-
Thus, we want to reach a point where a baseline replica is provisioned that requires minimal manual intervention after provisioning
- We have to delete the chef-provisioned configuration files and let Patroni regenerate them
- We need to create a runbook to capture this(https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13343)
Problem
When we provisioned a new Postgres 12 replica for GCS snapshots and cascade replication node (MR, pipeline), we found that Chef produces a non-functional replica.
@Finotto fixed various problems so that the replica was able to come up, but we need to figure out why Chef is producing unusable nodes. Chef is currently disabled on all patroni-*
nodes.
Our plan is to replicate the v12 chef roles in the benchmarking environment to start debugging there.
FAQ
- How is this possible if we have 10 nodes in production running right now?
Nine of these were originally produced by provisioning v11, which were then upgraded through Ansible to v12. Chef would then be made to match (which turned out to be significantly trickier than we thought). The last replica was directly provisioned as v12, which is not something we had ever done.