Raft: Implement Raft replica bootstrapping
This issue tracks the work of bootstrapping replication after the cluster is bootstrapped (#6032 (closed)). As of the original design, replica data placement is stable. We follow a ring-like replica scheme.
Assuming we have 5 storages: a, b, c, d, e.
- storage-a is replicated to storage-b, storage-c
- storage-b is replicated to storage-c, storage-d
- storage-c is replicated to storage-d, storage-e
- storage-d is replicated to storage-e, storage-a
- storage-e is replicated to storage-a, storage-b
In the case of 3 storages, they form a complete graph.
During implementation, we structure each partition as equivalent to one Raft group. This inevitable decision leads to two interesting consequences:
- We could not enforce the strict replication order (a -> b, c). After creation, a repository's
leader
role might float freely between a, b, and c. Although Raft tries its best to stabilize leadership role, there is no guarantee about its location. - A storage must start a Raft group to replicate its changes. There must be a way to signal other nodes to start and a corresponding Raft group. In addition, the list of repositories must be persisted.
The first consequence is not necessarily a bad thing. In contrast, it increases the resilience of the cluster. Let's examine this example. Each storage holds the leadership role of 2 repositories and replica role of another 4 repositories (from 2 other nodes).
storage-a | storage-b | storage-c | storage-d | storage-e | |
---|---|---|---|---|---|
Leader of | 1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10 |
Replica of | 7, 8, 9, 10 | 9, 10, 1, 2 | 1, 2, 3, 4 | 3, 4, 5, 6 | 5, 6, 7, 8 |
When storage-b goes down, repositories, in which storage-b holds the leader role, are evicted to storage-c and storage-d equally. Each repository's quorum is still maintained.
storage-a |
|
storage-c | storage-d | storage-e | |
---|---|---|---|---|---|
Leader of | 1, 2 | 4, 5, 6 | 3, 7, 8 | 9, 10 | |
Replica of | 7, 8, 9, 10 | 1, 2, 3 | 4, 5, 6 | 5, 6, 7, 8 |
When storage-b is back, the leaders of repositories might shuffle, but in general, the number of leaders and replicas won't change.
The second one could be solved by special "replica Raft groups".
- When a storage-a starts, it starts a "replica Raft group" in which it's the only voting member and storage-b and storage-c are non-voting members. It's also the non-voting member of replica Raft groups hosted by storage-d and storage-e.
- When a repository (repository-1 for example) is created in storage-a, storage-a starts a repository Raft group in which storage-a is the initial member. It then writes into the replica Raft group. storage-b and storage-c pick up the new log, persist it, and start corresponding repository Raft groups. repository-1's Raft group has three voting members afterward: storage-a, storage-b, storage-c.
- If storage-a goes down, its replica Raft group doesn't function. That's fine because it's not capable of creating a new repository anyway. The quorum of repository-1's Raft group is maintained. Hence, it performs a new election term and evicts its leader to either storage-b or storage-c.
- When storage-a goes back, it starts all repositories Raft groups from repositories of replica Raft groups of which it's a member. They are the repositories that storage-a needs to maintain. If storage-d or storage-e creates new repositories while it's away, storage-a is aware of them after it fetches the newest log entries.
As in #6104, we reason about keeping relative paths as SSOT. However, as dragonboat doesn't allow using a string as ShardID, we need to maintain a hashmap between {relative_path: shard_id}
.