[concept] Cold Standby
This is more of a thought experiment than a decided feature, there are things that weight for and against it. However following is a hopefully sound algorithm to provide cold standby systems. What does cold standby mean here: A second 'version' of the system that is powered off but able to take over the original systems functionality in the case of an outage.
Pro and opportunities:
- This is as close to a hot migration as you can get.
- It provides a vastly improved availability for legacy systems.
- Boost attractiveness for legacy users.
Cons and risks:
- Supports use of legacy applications (delays migration towards a proper cloud architecture)
- When done automagically and not by an operator it either involves the risk of a Hot/Hot or Cold/Cold situation causing undefined behavior.
- Will use twice the space.
- Will require cpu and network resources to perform the sync.
- I believe that an algorithm like (raft or paxos) to enforce strong consistency is important, given a hot/hot situation will probably be the most devastating outcome.
a sync interval of 1s is taken as an example here. Names: F (fifo aka sniffle to provide quorum/consensus) H1 (1st hypervisor) H2 (2nd hypervisor) (marks hot hypervisor) H1 and H2 run an FSM that is 1) directly connected 2) is having access to F for consensus. The VM is created on H1*, if a vm has only 1 hypervisor assigned this hypervisor is automatically declared hot. H2 is added as a standby. H2 connects to H1*. H1* enters connected state. H1* syncs the last known common state with H2: none H1* performs a zfs send/receive of S1 the vm to H2. H1* syncs the last known common state with H2: S1 H1* sends an incremetal snapshot S2 to S1. . . . H1* goes down. H2 looses connectivity to H1*. H2 performs a reconnection attempt and fails. H2 requests the quorum from F, since F can't reach H1 either it grants the quorum to H2* (F + H2* have more say then H1). H1 comes online again, starts it boots in cold mode. H1 requests a list of hypervisors and finds H2* active. H1 connects to H2* as standby. (sync etc happens just in the opposite direction.)
kevin: Seem like the wrong layer to gain redundancy. There is a lot of things that can go wrong is a system like this.