START issueshttps://gitlab.com/Project-FiFo/FiFo/help/-/issues2017-09-28T14:02:55Zhttps://gitlab.com/Project-FiFo/FiFo/help/-/issues/38[concept] Cold Standby2017-09-28T14:02:55ZHeinz N. Gies[concept] Cold StandbyThis is more of a thought experiment than a decided feature, there are things that weight for and against it. However following is a hopefully sound algorithm to provide cold standby systems.
What does cold standby mean here: A second 'v...This is more of a thought experiment than a decided feature, there are things that weight for and against it. However following is a hopefully sound algorithm to provide cold standby systems.
What does cold standby mean here: A second 'version' of the system that is powered off but able to take over the original systems functionality in the case of an outage.
Pro and opportunities:
* This is as close to a hot migration as you can get.
* It provides a vastly improved availability for legacy systems.
* Boost attractiveness for legacy users.
Cons and risks:
* Supports use of legacy applications (delays migration towards a proper cloud architecture)
* When done automagically and not by an operator it either involves the risk of a Hot/Hot or Cold/Cold situation causing undefined behavior.
* Will use twice the space.
* Will require cpu and network resources to perform the sync.
* I believe that an algorithm like (raft or paxos) to enforce strong consistency is important, given a hot/hot situation will probably be the most devastating outcome.
## Algorithm (draft):
```
a sync interval of 1s is taken as an example here.
Names:
F (fifo aka sniffle to provide quorum/consensus)
H1 (1st hypervisor)
H2 (2nd hypervisor)
(marks hot hypervisor)
H1 and H2 run an FSM that is
1) directly connected
2) is having access to F for consensus.
The VM is created on H1*, if a vm has only 1 hypervisor assigned this hypervisor is automatically declared hot.
H2 is added as a standby.
H2 connects to H1*.
H1* enters connected state.
H1* syncs the last known common state with H2: none
H1* performs a zfs send/receive of S1 the vm to H2.
H1* syncs the last known common state with H2: S1
H1* sends an incremetal snapshot S2 to S1.
.
.
.
H1* goes down.
H2 looses connectivity to H1*.
H2 performs a reconnection attempt and fails.
H2 requests the quorum from F, since F can't reach H1 either it grants the quorum to H2* (F + H2* have more say then H1).
H1 comes online again, starts it boots in cold mode.
H1 requests a list of hypervisors and finds H2* active.
H1 connects to H2* as standby.
(sync etc happens just in the opposite direction.)
```
kevin:
Seem like the wrong layer to gain redundancy. There is a lot of things that can go wrong is a system like this.