Postgres pending WAL files on primary is high
Summary of recent spikes where pg_archiver_pending_wal_count > 2000
:
- 2022-03-14 13:28 UTC - 18:28 UTC of the same day
- 2022-03-14 19:48 UTC - 20:35 UTC of the same day
- 2022-03-15 12:07 UTC - 21:32 UTC of the same day
For the last few months the Postgres pending WAL files on primary is high
alert has been firing repeatedly, for example:
- production#6518 (closed)
- production#6468 (closed)
- production#6463 (closed)
- production#6414 (closed)
- production#5937 (closed)
(see the issues linked above for more details on previous investigations)
The purpose of this issue is to:
- finish the investigation on what's causing the backlog
- take the steps necessary to prevent the backlog from happening again
Concise background summary
See these notes if you want a quick primer on the pathology, why it matters, and a categorical framing of the solution space:
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15362#note_877494437
Status summary as of 2022-03-21
We are bumping the alerting threshold from 3k to 5k: gitlab-com/runbooks!4453 (merged).
We will open some follow-up issues for further avenues of investigation. Getting debug symbols for WAL-G for better profiling is one of them.
Edited by Igor