Skip to content

Postgres pending WAL files on primary is high

Summary of recent spikes where pg_archiver_pending_wal_count > 2000:

  • 2022-03-14 13:28 UTC - 18:28 UTC of the same day
  • 2022-03-14 19:48 UTC - 20:35 UTC of the same day
  • 2022-03-15 12:07 UTC - 21:32 UTC of the same day

https://thanos.gitlab.net/graph?g0.expr=pg_archiver_pending_wal_count%7Benvironment%3D%22gprd%22%2Ctype%3D%22patroni%22%7D%20%3E%202000&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

For the last few months the Postgres pending WAL files on primary is high alert has been firing repeatedly, for example:

(see the issues linked above for more details on previous investigations)

The purpose of this issue is to:

  • finish the investigation on what's causing the backlog
  • take the steps necessary to prevent the backlog from happening again

Concise background summary

See these notes if you want a quick primer on the pathology, why it matters, and a categorical framing of the solution space:

https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15362#note_877494437

Status summary as of 2022-03-21

We are bumping the alerting threshold from 3k to 5k: gitlab-com/runbooks!4453 (merged).

We will open some follow-up issues for further avenues of investigation. Getting debug symbols for WAL-G for better profiling is one of them.

Edited by Igor