Chef is unable to produce Postgres v12 nodes

Summary

The source of most of these issues is the placement of the Postgres configuration file in the data directory.

This file was located in the home directory of the gitlab-psql user
- In cookbooks, this location was referred to as the postgres_config_dir
- However, semantically, we overloaded that variable to refer to the user's home directory
A different location was required because the upgrade from v11 to v12 required both versions' configuration files to be present during the upgrade
- The file was placed inside v12's data directory, which is a relatively common practice
- An update to the recipe was made so that postgres_config_dir pointed to data12
This had side-effects
- The location of the gitlab-psql home directory was moved to data12, which moved various other files there
- Postgres correctly expects the data directory to be empty when it's initialized

Status

We have decoupled the locations of postgres_config_dir and pguser_home_dir
However, Chef generates a dummy postgres.conf in the configuration directory, and various cookbooks expect this to be case
- The scope of changes is amplified significantly, so updating those cookbooks is unwise at the moment
- Thus, we want to reach a point where a baseline replica is provisioned that requires minimal manual intervention after provisioning
  - We have to delete the chef-provisioned configuration files and let Patroni regenerate them
  - We need to create a runbook to capture this(https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13343)

Problem

When we provisioned a new Postgres 12 replica for GCS snapshots and cascade replication node (MR, pipeline), we found that Chef produces a non-functional replica.

@Finotto fixed various problems so that the replica was able to come up, but we need to figure out why Chef is producing unusable nodes. Chef is currently disabled on all patroni-* nodes.

Our plan is to replicate the v12 chef roles in the benchmarking environment to start debugging there.

FAQ

How is this possible if we have 10 nodes in production running right now?

Nine of these were originally produced by provisioning v11, which were then upgraded through Ansible to v12. Chef would then be made to match (which turned out to be significantly trickier than we thought). The last replica was directly provisioned as v12, which is not something we had ever done.

Edited May 11, 2021 by Gerardo Lopez-Fernandez