Handle the new Consul behaviour introduced by the "server_rejoin_age_max" setting

In Consul 1.15 a new setting was introduced, server_rejoin_age_max, which introduced the following new behaviour:

Controls the allowed maximum age of a stale server attempting to rejoin a cluster. If the server has not ran during this period, it will refuse to start up again until an operator intervenes by manually deleting the server_metadata.json file located in the data dir. This is to protect clusters from instability caused by decommissioned servers accidentally being started again. Note: the default value is 168h (equal to 7d) and the minimum value is 6h.

The consequence of this behaviour is that is applies to all server nodes. This means if an environment was turned off for only a week then Consul now refuses to start outright and the only way to fix this is manually going into each box and deleting the data directory, which is a costly UX admin experience. This consequence has resulted in some converse with an active issue currently open.

With the heavy cost of this setting being triggered it may be desirable on our end to disable this setting as the arguably the cost doesn't justify the means in our case.

Edited Jan 16, 2024 by Grant Young
Assignee Loading
Time tracking Loading