Consider the current method for checking for which postgresql server is primary is too dangerous for use
Summary
The current method of checking for a primary in a postgresql cluster is dangerous if there's an event involving the cluster where a server might already be down. The command gitlab-ctl repmgr cluster show
will try to reach out to every server participating in the repmgr configuration. When a server is down, this command will take a long time. In doing so, we do not progress through this function properly.
Due to this, consul does not configure the databases.ini. If this problem is active during a situation where postgres fails over to a new primary, consul will be unable to update pgbouncer, and all connections will be incorrect, leading to an outage.
As an added bonus to this problem, since this function appears to execute every 10 seconds (at least on GitLab.com), we end up with a bunch of backed up requests performing the same check, since the timing of this check takes roughly 2 minutes.
Proposal
Determine a better method to figure out the primary postgresql node.
References
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5357 https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-ctl-commands-ee/lib/repmgr.rb#L225-238