support for max_slot_wal_keep_size in gitlab.rb for both postgresql[] and patroni['postgresql'][]
Summary
Prior to PostgreSQL 13, there was no way to limit disk (by write ahead logs) used by a replication slot, and so it can fill up the disk in the primary.
If one or more replicas are down, this can cause a cascading failure. This happened to a customer's Patroni cluster, GitLab team members with Zendesk access can read more in the ticket.
- Patroni leader demoted itself.
- One of the replicas took over as leader.
- Patroni leader was unable to complete the
pg_rewind
process because the new leader didn't have the required WAL files. - The replication slot caused the new leader to retain WAL files for the replica (old leader).
- After a few hours the new leader panicked because the disk had filled and it was unable to write any more WAL files.
- The leader was then in a loop initialising (which happens quickly) and panicking (which takes longer as attempts to write to WAL didn't happen immediately) and so the remaining replica didn't take over. Regardless, it's possible that the final replica would have filled its disk as well even if it had taken over.
PG13 adds max_slot_wal_keep_size
Proposal
Allow max_slot_wal_keep_size
to be set in gitlab.rb
so disk use when a replica is down can be limited, and service maintained.
I think it'll be needed for both postgresql
and patroni['postgresql']
, but the priority in my mind is Patroni. Generally, though, patroni['postgresql']
seems to offer a subset of the settings available within postgresql
.