support for max_slot_wal_keep_size in gitlab.rb for both postgresql[] and patroni['postgresql'][]

Summary

Prior to PostgreSQL 13, there was no way to limit disk (by write ahead logs) used by a replication slot, and so it can fill up the disk in the primary.

If one or more replicas are down, this can cause a cascading failure. This happened to a customer's Patroni cluster, GitLab team members with Zendesk access can read more in the ticket.

Patroni leader demoted itself.
One of the replicas took over as leader.
Patroni leader was unable to complete the pg_rewind process because the new leader didn't have the required WAL files.
The replication slot caused the new leader to retain WAL files for the replica (old leader).
After a few hours the new leader panicked because the disk had filled and it was unable to write any more WAL files.
The leader was then in a loop initialising (which happens quickly) and panicking (which takes longer as attempts to write to WAL didn't happen immediately) and so the remaining replica didn't take over. Regardless, it's possible that the final replica would have filled its disk as well even if it had taken over.

PG13 adds max_slot_wal_keep_size

Proposal

Allow max_slot_wal_keep_size to be set in gitlab.rb so disk use when a replica is down can be limited, and service maintained.

I think it'll be needed for both postgresql and patroni['postgresql'], but the priority in my mind is Patroni. Generally, though, patroni['postgresql'] seems to offer a subset of the settings available within postgresql.

References

Edited Jun 27, 2023 by Ben Prescott_

Admin message

support for max_slot_wal_keep_size in gitlab.rb for both postgresql[] and patroni['postgresql'][]

Summary

Proposal

References