RabbitMQ pod-0 doesn't join the cluster
Summary
Since RabbitMQ 4.1 the Kubernetes Peer Discovery changed to always join pod-0 and pod-0 always start directly[1]. This will create a second cluster if you start pod-0 new and the PVC got deleted (e.g. due to redeploy of node)
Should we implement the "old" discover mechanism and check if there are other pods, than join pod-0 to the cluster?
Detailed Description
The blog entry[2] but also a issue at rabbitmq cluster-operator[3] stated, that this is more or less expected and one should manual join the node (or not delete the data - but that's against the yaook way to redeploy everything...)
Steps to reproduce the issue
- Delete the PVC of MQ pod-0
- Delete/Restart pod-0
- May needed: run
rabbitmqctl forget_cluster_node rabbit@$node - Wait till the node comes up
- Check
rabbitmqctl cluster_statuson pod-0 and any other MQ pod
Result
The pod-0 started it's own cluster and didn't joined the existing cluster. Queues are not synced to pod-0. Probably restarting other MQ pods fails, as their queues would be out of quorum than.
Expected Result
Pod-0 should join the cluster?
Additional Information
[1] https://www.rabbitmq.com/docs/cluster-formation#kubernetes-peer-discovery-overview
[3] https://github.com/rabbitmq/cluster-operator/issues/1957#issuecomment-3304012484
Resolution
- Adjust the MQ sts to check if there are other pods, than join the node.
- Don't join automatically but build tooling (e.g. in yaookctl) to join the node again.
Proposal
To be discussed.
Specification
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this issue are to be interpreted in the spirit of RFC 2119, even though we're not technically doing protocol design.