RabbitMQ nodes don't join the cluster and start as standalone node after redeploying nodes
Summary
After redeploying nodes we observed that the pods in the stateful set are "healthy", but are really just alone in their own cluster.
Detailed Description
In such a case we see in the logs:
[error] <0.276.0> Peer discovery: could not discover and join another node; proceeding as a standalone node
However, despite the error, the pod is running healthy.
Others also found this behavior problematic, but no technical solution is being created as of now: https://github.com/rabbitmq/rabbitmq-server/discussions/13257
Steps to reproduce the issue
- Redeploy a node with mq pods/ delete the mq pods PVC
- Run
rabbitmqctl cluster_statusand see that the node is alone
Result
see above
Expected Result
The node should have joined its peers.
Additional Information
Resolution
Proposal
We propose to check the output of rabbitmq-diagnostics discover_peers and compare it with the number of disk nodes in the output of rabbitmqctl cluster_status.
If rabbitmq-diagnostics discover_peers shows more than one peer, rabbitmqctl cluster_status should also return more than one disk node.
We propose to add this check to the startup probe on single-node clusters.
Specification
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this issue are to be interpreted in the spirit of RFC 2119, even though we're not technically doing protocol design.