etcd: avoid "panic: etcdserver: mvcc: database space exceeded"
We observed a platform using RKE2 where etcd stops working because of:
Jan 02 08:58:48 first-workload-cluster-control-plane-zwqd5 rke2[3017239]: {"level":"warn","ts":"2024-01-02T08:58:48.916Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001636000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
Jan 02 08:58:48 first-workload-cluster-control-plane-zwqd5 rke2[3017239]: panic: etcdserver: mvcc: database space exceeded
This is a well documented etcd behavior: etcd has a maximum size for its database, with a conservative limit of 2GB.
Fixing a running plaform requires applying compact and defrag operations:
- https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes#compact-the-keyspace
- https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes#defrag-all-etcd-members
- the subtlety is that running those command is not trivial: we can't use
crictl execreliably because rke2-server systemd is constantly restarting, so we have to usensenter $(pgrep etcd) -a shinstead
Then the default etcd settings need to be tuned to avoid the issue from re-occurring.
We need to fix default etcd settings in Sylva to have automated compaction and a larger DB size
(quota-backend-bytes, auto-compaction-retention and auto-compaction-mode etcd settings).
I'm not sure if this is specific to RKE2 or if kubeadm already has better settings.
We need to check the two, and at least to fix RKE2.