Pause migration when WAL queue pending archival crossed a threshold
Extracted from #353395 (comment 892991285)
Overview
- Indicator: Number of WAL segments pending archival
- Source: Query primary (see below)
- Action: Pause migration if pending wal queue length is greater then configurable threshold
- Parameter: threshold
- Needs prometheus: No
WAL archival is crucial to database recovery objectives and potentially also availability upon failover scenarios (depending on implementation).
Once a certain threshold (configurable, defaults to what's necessary for GitLab.com) is crossed, we stop data migrations to let the system recover and work on the backlog of WAL segments before the migration resumes.
The number of WAL segments pending archival can be retrieve with below query or through prometheus.
SQL query:
WITH
current_wal_file AS (
SELECT CASE WHEN NOT pg_is_in_recovery() THEN pg_xlogfile_name(pg_current_xlog_insert_location()) ELSE NULL END pg_xlogfile_name
),
current_wal AS (
SELECT
('x'||substring(pg_xlogfile_name,9,8))::bit(32)::int log,
('x'||substring(pg_xlogfile_name,17,8))::bit(32)::int seg,
pg_xlogfile_name
FROM current_wal_file
),
archive_wal AS(
SELECT
('x'||substring(last_archived_wal,9,8))::bit(32)::int log,
('x'||substring(last_archived_wal,17,8))::bit(32)::int seg,
last_archived_wal
FROM pg_stat_archiver
)
SELECT coalesce(((cw.log - aw.log) * 256) + (cw.seg-aw.seg),'NaN'::float) as pending_wal_count FROM current_wal cw, archive_wal aw
Prometheus query:
pg_archiver_pending_wal_count{env="gprd"} and on(fqdn) (pg_replication_is_replica{type=~"patroni|patroni-registry|patroni-ci"} == 0)
Edited by Andreas Brandl