Chef is unable to produce Postgres v12 nodes

Summary

The source of most of these issues is the placement of the Postgres configuration file in the data directory.

  • This file was located in the home directory of the gitlab-psql user
    • In cookbooks, this location was referred to as the postgres_config_dir
    • However, semantically, we overloaded that variable to refer to the user's home directory
  • A different location was required because the upgrade from v11 to v12 required both versions' configuration files to be present during the upgrade
    • The file was placed inside v12's data directory, which is a relatively common practice
    • An update to the recipe was made so that postgres_config_dir pointed to data12
  • This had side-effects
    • The location of the gitlab-psql home directory was moved to data12, which moved various other files there
    • Postgres correctly expects the data directory to be empty when it's initialized

Status

Problem

When we provisioned a new Postgres 12 replica for GCS snapshots and cascade replication node (MR, pipeline), we found that Chef produces a non-functional replica.

@Finotto fixed various problems so that the replica was able to come up, but we need to figure out why Chef is producing unusable nodes. Chef is currently disabled on all patroni-* nodes.

Our plan is to replicate the v12 chef roles in the benchmarking environment to start debugging there.

FAQ

  • How is this possible if we have 10 nodes in production running right now?

Nine of these were originally produced by provisioning v11, which were then upgraded through Ansible to v12. Chef would then be made to match (which turned out to be significantly trickier than we thought). The last replica was directly provisioned as v12, which is not something we had ever done.

Edited by Gerardo Lopez-Fernandez