Analyze the saturation risks for Postgres processes on Main and CI databases (walreceiver, walsender, archiver, etc.)

The idea of this issue is to proactively analyze the risks of saturation of Postgres processes in gprd large DBs (Main, CI), to see how far we are from saturation points, what are the rooms for growth in next few years and how to improve. This work should help handle growing workload and data volumes better, avoiding incidents.

Postgres processes are single-threaded and with our scale, single-vCPU saturation can be an issue (similar to what we've observed in the past for pgBouncer, walsender, and logical replication worker).

Examples

A couple of examples for critically important processes:

Recent walreceiver saturation causing replication lags on grpd-main replicas https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2597.
Backups (WAL archiving): gprd-main primary experiences spikes of pending WALs:

source

WAL-G has logic for parallelization of WAL archiving (if more than 1 WAL is pending, it attempts to archive min(16,N+1) WALs in parallel), but obviously we have spikes, this requires a further analysis.

Holistic analysis

Processes to analyze (draft, WIP):

Component	Processes	Risks	Current state	Optimization ideas	Saturation analysis is present in Tamland
Replication	`walsender`, `walreceiver` (+ `logical replication worker` temporarily)	Lagging replicas	walreceiver: high single vCPU load on gprd-main replicas	Main idea: reduce WAL generation rates (optimize queries, drop unused/redundant indexes, tune checkpoints; partition large tables to achieve better data locality)	no
Backups	`archiver`, WAL-G processes	WAL archive lagging on gprd-main primary, affecting RPO	Occasional spikes to ~100 pending WALs	TBD	no
Checkpointing	`checkpointer`, `background writer`	Overlapping checkpoints, risks of increased recovery time after crashes			no
Connection pooling	PgBouncer workers	Partial inability to process workload			yes
Vacuuming	autovacuum workers	1) Slower vacuuming → potentially higher bloat, degraded performance. 2) Lagging table statistics (ANALYZE) → wrong plan choices, degraded performance. 3) Slower freezing → higher risks of XID/MXID wraparound
Stats collection	`stats collector`	Irrelevant cumulative stats (NOT used by query planner)	moderate level of vCPU usage (details TBD)
Logging	`logger`		CPU usage is very low		no
Extensions	`pg_wait_sampling collector`		CPU usage is very low		no
Index creation	regular `postgres` backend	Long index build time blocking autovacuum work	Partitioning		no
WAL writer	`walwriter`	tbd	tbd		no

Edited Nov 28, 2023 by Nikolay Samokhvalov