RabbitMQ nodes don't join the cluster and start as standalone node after redeploying nodes

Summary

After redeploying nodes we observed that the pods in the stateful set are "healthy", but are really just alone in their own cluster.

Detailed Description

In such a case we see in the logs:

[error] <0.276.0> Peer discovery: could not discover and join another node; proceeding as a standalone node

However, despite the error, the pod is running healthy.

Others also found this behavior problematic, but no technical solution is being created as of now: https://github.com/rabbitmq/rabbitmq-server/discussions/13257

Steps to reproduce the issue

  1. Redeploy a node with mq pods/ delete the mq pods PVC
  2. Run rabbitmqctl cluster_status and see that the node is alone

Result

see above

Expected Result

The node should have joined its peers.

Additional Information

Resolution

Proposal

We propose to check the output of rabbitmq-diagnostics discover_peers and compare it with the number of disk nodes in the output of rabbitmqctl cluster_status. If rabbitmq-diagnostics discover_peers shows more than one peer, rabbitmqctl cluster_status should also return more than one disk node. We propose to add this check to the startup probe on single-node clusters.

Specification

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this issue are to be interpreted in the spirit of RFC 2119, even though we're not technically doing protocol design.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information