Possible corruption on grpd-registry Postgres standby nodes – check and mitigate if any
According to the findings in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15925, there are chances that gprd-registry replicas might have corruption in certain table/index files. Good news is that we haven't had switchover/failover yet:
patroni-registry-v14-01-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-patronictl list
+--------------------------------------------------------------+---------------+---------+---------+----+-----------+---------------------+
| Member | Host | Role | State | TL | Lag in MB | Tags |
+ Cluster: gprd-pg14-patroni-registry (7243063938146092911) ---+---------------+---------+---------+----+-----------+---------------------+
| patroni-registry-v14-01-db-gprd.c.gitlab-production.internal | 10.220.27.101 | Leader | running | 2 | | |
+--------------------------------------------------------------+---------------+---------+---------+----+-----------+---------------------+
| patroni-registry-v14-02-db-gprd.c.gitlab-production.internal | 10.220.27.102 | Replica | running | 2 | 0 | |
+--------------------------------------------------------------+---------------+---------+---------+----+-----------+---------------------+
| patroni-registry-v14-03-db-gprd.c.gitlab-production.internal | 10.220.27.103 | Replica | running | 2 | 4 | nofailover: true |
| | | | | | | noloadbalance: true |
+--------------------------------------------------------------+---------------+---------+---------+----+-----------+---------------------+
-- so if there is any standby corruption, it could lead to occasional XX001, XX002 errors in the Postgres logs on standby nodes. The good news here is that so far, we haven't seen such errors (likely because all or majority of the traffic goes to the primary): check for all three gprd-registry nodes:
The idea of this check is:
- run pg_amcheck on all gprd-registry nodes, with
--heapallindexed
, to check both indexes and heap. - if corruption on standby nodes is confirmed, consider to:
- temporarily disable autofailover in patroni to avoid the case when corruption standby node becomes primary (if failover happens) - perhaps, issuing pause in patroni
- rebuild standby nodes as soon as possible – we should do it right now, before
- enable autofailover
pg_amcheck --jobs=$N --heapallindexed
will get long-lasting locks, but not invasive, and its impact will be similar to running pg_dump
in multiple threads or multiple SELECTs from large tables - from the docs:
bt_index_check acquires an AccessShareLock on the target index and the heap relation it belongs to. This lock mode is the same lock mode acquired on relations by simple SELECT statements. bt_index_check does not verify invariants that span child/parent relationships, but will verify the presence of all heap tuples as index tuples within the index when heapallindexed is true. When a routine, lightweight test for corruption is required in a live production environment, using bt_index_check often provides the best trade-off between thoroughness of verification and limiting the impact on application performance and availability.
The additional heapallindexed phase adds significant overhead: verification will typically take several times longer. However, there is no change to the relation-level locks acquired when heapallindexed verification is performed.
How to check
Snippet to check (using 84 threads; UPDATE: 8 threads proved to be too active on grpd-registry, producing too much read tput on disk - at levels 500-600 MiB/s with spikes to 1-1.2 GiB/s which is our limit, so it's better to use just 4 threads):
export PGPASSWORD=...
export DBNAME=gitlabhq_registry # for ci and main: gitlabhq_production
/usr/lib/postgresql/14/bin/pg_amcheck \
--jobs=4 \
--heapallindexed \
-U gitlab-superuser $DBNAME -h localhost \
2>&1 \
| ts | tee /tmp/$(date +%Y%m%d_%H%M%S).$(hostname).amcheck.log