Fix cnpg wal issues (!6849) · Merge requests · Sylva-projects / sylva-core

What does this MR do and why?

This MR should fix the cnpg WAL issues described in various issues like #2357 (closed) #2648 (closed) #2744 (closed)

Initial analysis and testing has shown that this WAL accumulation occurs when streaming replication is blocked for some reason. In such circumstances, the primary server was filling its storage at high rate.

We've found that increasing checkpoint_timeout (!5360 (merged)) and archive_timeout (!5922 (merged)) helped to reduce the frequency of WAL checkpoints, and reduce the storage filling rate (see this comment)

These change are however not preventing the PVC to fill forever (at a slow rate) until some operator fixes the failed replicas.

While reading again at CNPG documentation about replication strategies, I found this chapter that describes exactly the situation that we are observing, and explains that max_slot_wal_keep_size should be set in such cases to prevent WAL from accumulating forever.

This is the first commit of this MR: it sets max_slot_wal_keep_size to a slightly smaller value that the PVC size.

This strategy has however a side effect, since it may lead some replicas to lag behind the last checkpoint, and remain stuck waiting for deleted segments (this is already something that we've ween observing even prior to set max_slot_wal_keep_size).

In such case we can observe logs like:

ERROR: requested WAL segment xxxxxxxxxxxxxxxxxx has already been removed

could not start WAL streaming: ERROR: can no longer access replication slot \"_cnpg_keycloak_postgresql_9\"\nDETAIL: This replication slot has been invalidated due to \"wal_removed\"

But CNPG cluster and replicas can still have a Ready status in such cases (this is at least what I've observed with the test procedure described below), which can explain why we were silently ignoring these issues in our deployments.

In order to detect and break such deadlocks, this MR introduces a cronjob that checks for pg_replication_slots on the primary node, and deletes any replica and its PVC that would be waiting for removed WAL segments:

❯ k exec -it -n keycloak keycloak-postgresql-10 -- psql -c 'SELECT slot_name, restart_lsn, wal_status, invalidation_reason  FROM pg_replication_slots;'
Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
          slot_name          | restart_lsn | wal_status | invalidation_reason 
-----------------------------+-------------+------------+---------------------
 _cnpg_keycloak_postgresql_9 |             | lost       | wal_removed                  # <<< this replica is Ready but will never recover
 _cnpg_keycloak_postgresql_4 | 19/3C000060 | reserved   |

In a third commit, the anti-affinity policy is set to required only if there are at least 4 nodes in the cluster, otherwise it may prevent a replica from being scheduled during rolling upgrades if only two nodes are available, which can cause a replica to be out of sync.

As an alternative, we could also consider using quorum-based synchronous replication instead of current streaming replication strategy, it could be more adapted for our use case where some replicas may become unavailable.

#2357 (closed) #2744 (closed)

Closes: #2648 (closed)

Test coverage

I've been able to reproduce and test such conditions in my platform:

I've changed cluster checkpoint_timeout and archive_timeout back to 5mn to increase wal fill rate
Then applied a networkPolicy to isolate a replica

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-traffic
  namespace: keycloak
spec:
  podSelector:
    matchLabels:
      cnpg.io/instanceName: keycloak-postgresql-3
  policyTypes:
    - Ingress
    - Egress

After some time, primary PVC starts to fill, until it reaches max_slot_wal_keep_size, then it shrinks back to ~wal_keep_size (600MB)
Once NetworkPolicy is removed, we see that replica starts complaining about missing WAL segments.

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon	Meaning	Available values
☁️	Infra Provider	`capd`, `capo`, `capm3`
🚀	Bootstrap Provider	`kubeadm` (alias `kadm`), `rke2`, `okd`, `ck8s`
🐧	Node OS	`ubuntu`, `suse`, `na`, `leapmicro`
🛠️	Deployment Options	`light-deploy`, `dev-sources`, `ha`, `misc`, `maxsurge-0`, `logging`, `no-logging`, `cilium`
🎬	Pipeline Scenarios	Available scenario list and description
🟢	Enabled units	Any available units name, by default apply to management and workload cluster. Can be prefixed by `mgmt:` or `wkld:` to be applied only to a specific cluster type
🏗️	Target platform	Can be used to select specific deployment environment (i.e `real-bmh` for capm3 )

Global config for deployment pipelines

autorun pipelines
allow failure on pipelines
record sylvactl events

Notes:

Enabling autorun will make deployment pipelines to be run automatically without human interaction
Disabling allow failure will make deployment pipelines mandatory for pipeline success.
if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited Feb 09, 2026 by Francois Eleouet

Fix cnpg wal issues

What does this MR do and why?

Related reference(s)

Test coverage

CI configuration

Global config for deployment pipelines

Merge request reports