Skip to content

etcd: avoid "panic: etcdserver: mvcc: database space exceeded"

We observed a platform using RKE2 where etcd stops working because of:

Jan 02 08:58:48 first-workload-cluster-control-plane-zwqd5 rke2[3017239]: {"level":"warn","ts":"2024-01-02T08:58:48.916Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001636000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
Jan 02 08:58:48 first-workload-cluster-control-plane-zwqd5 rke2[3017239]: panic: etcdserver: mvcc: database space exceeded

This is a well documented etcd behavior: etcd has a maximum size for its database, with a conservative limit of 2GB.

Fixing a running plaform requires applying compact and defrag operations:

Then the default etcd settings need to be tuned to avoid the issue from re-occurring.

We need to fix default etcd settings in Sylva to have automated compaction and a larger DB size (quota-backend-bytes, auto-compaction-retention and auto-compaction-mode etcd settings).

I'm not sure if this is specific to RKE2 or if kubeadm already has better settings.

We need to check the two, and at least to fix RKE2.