Upgrade Patroni
Currently there could be a slow Patroni failover with a large pile of WAL pending for upload, in our version of Patroni.
In version 2.1.2 there was a fix for this behavior:
Release the leader lock when pg_controldata reports "shut down" (Alexander) To solve the problem of slow switchover/shutdown in case archive_command is slow/failing, Patroni will remove the leader key immediately after pg_controldata started reporting PGDATA as shut down cleanly and it verified that there is at least one replica that received all changes. If there are no replicas that fulfill this condition the leader key is not removed and the old behavior is retained, i.e. Patroni will keep updating the lock.
Ref: https://patroni.readthedocs.io/en/latest/releases.html#version-2-1-2
As discussed at https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15362#note_879101020 this should be considered priority.
Acceptance criteria:
-
Upgrade staging and production to the new version