Analyze the saturation risks for Postgres processes on Main and CI databases (walreceiver, walsender, archiver, etc.)
The idea of this issue is to proactively analyze the risks of saturation of Postgres processes in gprd large DBs (Main, CI), to see how far we are from saturation points, what are the rooms for growth in next few years and how to improve. This work should help handle growing workload and data volumes better, avoiding incidents.
Postgres processes are single-threaded and with our scale, single-vCPU saturation can be an issue (similar to what we've observed in the past for pgBouncer, walsender, and logical replication worker).
Examples
A couple of examples for critically important processes:
-
Recent
walreceiver
saturation causing replication lags on grpd-main replicas https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2597. -
Backups (WAL archiving): gprd-main primary experiences spikes of pending WALs:
WAL-G has logic for parallelization of WAL archiving (if more than 1 WAL is pending, it attempts to archive min(16,N+1) WALs in parallel), but obviously we have spikes, this requires a further analysis.
Holistic analysis
Processes to analyze (draft, WIP):
Component | Processes | Risks | Current state | Optimization ideas | Saturation analysis is present in Tamland |
---|---|---|---|---|---|
Replication |
walsender , walreceiver (+ logical replication worker temporarily) |
Lagging replicas | walreceiver: high single vCPU load on gprd-main replicas | Main idea: reduce WAL generation rates (optimize queries, drop unused/redundant indexes, tune checkpoints; partition large tables to achieve better data locality) | no |
Backups |
archiver , WAL-G processes |
WAL archive lagging on gprd-main primary, affecting RPO | Occasional spikes to ~100 pending WALs | TBD | no |
Checkpointing |
checkpointer , background writer
|
Overlapping checkpoints, risks of increased recovery time after crashes | no | ||
Connection pooling | PgBouncer workers | Partial inability to process workload | yes | ||
Vacuuming | autovacuum workers | 1) Slower vacuuming → potentially higher bloat, degraded performance. 2) Lagging table statistics (ANALYZE) → wrong plan choices, degraded performance. 3) Slower freezing → higher risks of XID/MXID wraparound | |||
Stats collection | stats collector |
Irrelevant cumulative stats (NOT used by query planner) | moderate level of vCPU usage (details TBD) | ||
Logging | logger |
CPU usage is very low | no | ||
Extensions | pg_wait_sampling collector |
CPU usage is very low | no | ||
Index creation | regular postgres backend |
Long index build time blocking autovacuum work | Partitioning | no | |
WAL writer | walwriter |
tbd | tbd | no |