Fix cnpg wal issues
What does this MR do and why?
This MR should fix the cnpg WAL issues described in various issues like #2357 (closed) #2648 (closed) #2744 (closed)
Initial analysis and testing has shown that this WAL accumulation occurs when streaming replication is blocked for some reason. In such circumstances, the primary server was filling its storage at high rate.
We've found that increasing checkpoint_timeout (!5360 (merged)) and archive_timeout (!5922 (merged)) helped to reduce the frequency of WAL checkpoints, and reduce the storage filling rate (see this comment)
These change are however not preventing the PVC to fill forever (at a slow rate) until some operator fixes the failed replicas.
While reading again at CNPG documentation about replication strategies, I found this chapter that describes exactly the situation that we are observing, and explains that max_slot_wal_keep_size should be set in such cases to prevent WAL from accumulating forever.
This is the first commit of this MR: it sets max_slot_wal_keep_size to a slightly smaller value that the PVC size.
This strategy has however a side effect, since it may lead some replicas to lag behind the last checkpoint, and remain stuck waiting for deleted segments (this is already something that we've ween observing even prior to set max_slot_wal_keep_size).
In such case we can observe logs like:
ERROR: requested WAL segment xxxxxxxxxxxxxxxxxx has already been removed
or
could not start WAL streaming: ERROR: can no longer access replication slot \"_cnpg_keycloak_postgresql_9\"\nDETAIL: This replication slot has been invalidated due to \"wal_removed\"
But CNPG cluster and replicas can still have a Ready status in such cases (this is at least what I've observed with the test procedure described below), which can explain why we were silently ignoring these issues in our deployments.
In order to detect and break such deadlocks, this MR introduces a cronjob that checks for pg_replication_slots on the primary node, and deletes any replica and its PVC that would be waiting for removed WAL segments:
❯ k exec -it -n keycloak keycloak-postgresql-10 -- psql -c 'SELECT slot_name, restart_lsn, wal_status, invalidation_reason FROM pg_replication_slots;'
Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
slot_name | restart_lsn | wal_status | invalidation_reason
-----------------------------+-------------+------------+---------------------
_cnpg_keycloak_postgresql_9 | | lost | wal_removed # <<< this replica is Ready but will never recover
_cnpg_keycloak_postgresql_4 | 19/3C000060 | reserved |
In a third commit, the anti-affinity policy is set to required only if there are at least 4 nodes in the cluster, otherwise it may prevent a replica from being scheduled during rolling upgrades if only two nodes are available, which can cause a replica to be out of sync.
As an alternative, we could also consider using quorum-based synchronous replication instead of current streaming replication strategy, it could be more adapted for our use case where some replicas may become unavailable.
Related reference(s)
Closes: #2648 (closed)
Test coverage
I've been able to reproduce and test such conditions in my platform:
- I've changed cluster checkpoint_timeout and archive_timeout back to 5mn to increase wal fill rate
- Then applied a networkPolicy to isolate a replica
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-traffic
namespace: keycloak
spec:
podSelector:
matchLabels:
cnpg.io/instanceName: keycloak-postgresql-3
policyTypes:
- Ingress
- Egress
- After some time, primary PVC starts to fill, until it reaches
max_slot_wal_keep_size, then it shrinks back to ~wal_keep_size(600MB) - Once NetworkPolicy is removed, we see that replica starts complaining about missing WAL segments.
CI configuration
Below you can choose test deployment variants to run in this MR's CI.
Click to open to CI configuration
Legend:
| Icon | Meaning | Available values |
|---|---|---|
| Infra Provider |
capd, capo, capm3
|
|
| Bootstrap Provider |
kubeadm (alias kadm), rke2, okd, ck8s
|
|
| Node OS |
ubuntu, suse, na, leapmicro
|
|
| Deployment Options |
light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging, cilium
|
|
| Pipeline Scenarios | Available scenario list and description | |
| Enabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
|
| Target platform | Can be used to select specific deployment environment (i.e real-bmh for capm3 ) |
-
🎬 preview☁️ capd🚀 kadm🐧 ubuntu -
🎬 preview☁️ capo🚀 rke2🐧 suse -
🎬 preview☁️ capm3🚀 rke2🐧 ubuntu -
☁️ capd🚀 kadm🛠️ light-deploy🐧 ubuntu -
☁️ capd🚀 rke2🛠️ light-deploy🐧 suse -
☁️ capo🚀 rke2🐧 suse -
☁️ capo🚀 rke2🐧 leapmicro -
☁️ capo🚀 kadm🐧 ubuntu -
☁️ capo🚀 kadm🐧 ubuntu🟢 neuvector,mgmt:harbor -
☁️ capo🚀 rke2🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capo🚀 kadm🎬 wkld-k8s-upgrade🐧 ubuntu -
☁️ capo🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capo🚀 rke2🎬 sylva-upgrade🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.6.x🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc,openbao🐧 suse -
☁️ capo🚀 rke2🐧 suse🎬 upgrade-from-prev-tag -
☁️ capm3🚀 rke2🐧 suse -
☁️ capm3🚀 kadm🐧 ubuntu -
☁️ capm3🚀 ck8s🐧 ubuntu -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha,misc🐧 ubuntu -
☁️ capm3🚀 rke2🎬 wkld-k8s-upgrade🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🎬 upgrade-from-prev-release-branch🛠️ ha🐧 suse -
☁️ capm3🚀 rke2🛠️ misc,ha🐧 suse -
☁️ capm3🚀 rke2🎬 sylva-upgrade🛠️ ha,misc🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 suse -
☁️ capm3🚀 ck8s🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2|okd🎬 no-update🐧 ubuntu|na -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-from-release-1.5 -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-to-main
Global config for deployment pipelines
- autorun pipelines
- allow failure on pipelines
- record sylvactl events
Notes:
- Enabling
autorunwill make deployment pipelines to be run automatically without human interaction - Disabling
allow failurewill make deployment pipelines mandatory for pipeline success. - if both
autorunandallow failureare disabled, deployment pipelines will need manual triggering but will be blocking the pipeline
Be aware: after configuration change, pipeline is not triggered automatically.
Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.