CTDB: do takeover run (failover) during shutdown
This is effectively a new implementation of !748 (closed), which was closed because the implementation used ctdbd_wrapper, which was removed.
During ctdbd shutdown, a takeover run is triggered and waited for. The time to wait can be controlled by new configuration option:
[failover]
shutdown failover timeout = 5
This is enabled by default with a timeout of 10 (seconds). Setting it to 0 disables failover during shutdown.
Doing this failover allows NFS servers on other nodes to be put into grace (in the startipreallocate event), so that locks can't be "stolen" by clients connected to other nodes between the time that the stopping node stops its NFS server (and releases locks) and the time that other nodes notice the node is gone, recover and do the subsequent takeover run.
An additional configuration option causes an extra delay after the takeover run completes:
[failover]
shutdown extra timeout = 10
This is disabled by default (set to 0). The documentation provides a couple more details. This extra timeout is similar to CTDB_DISABLE_BEFORE_SHUTDOWN_SLEEPTIME in !748 (closed).
According to !748 (closed) this will:
allow SMB2 Durable Handles reconnect work for the case when a node is administrately shutdown
While testing this, the bug worked around in the 1st commit (abort during early shutdown) was found. This commit reliably fixes the problem - I've run the test mentioned in the commit message 60 times and have not seen the bug occur again.
This MR fixes: https://bugzilla.samba.org/show_bug.cgi?id=15858
Checklist
-
Commits have Signed-off-by:with name/author being identical to the commit author -
(optional) This MR is just one part towards a larger feature. -
(optional, if backport required) Bugzilla bug filed and BUG:tag added -
Test suite updated with functionality tests -
Test suite updated with negative tests -
Documentation updated -
CI timeout is 3h or higher (see Settings/CICD/General pipelines/ Timeout)
Reviewer's checklist:
-
There is a test suite reasonably covering new functionality or modifications -
Function naming, parameters, return values, types, etc., are consistent and according to README.Coding.md -
This feature/change has adequate documentation added -
No obvious mistakes in the code