Raft: Implement Raft replica bootstrapping

This issue tracks the work of bootstrapping replication after the cluster is bootstrapped (#6032 (closed)). As of the original design, replica data placement is stable. We follow a ring-like replica scheme.

Assuming we have 5 storages: a, b, c, d, e.

storage-a is replicated to storage-b, storage-c
storage-b is replicated to storage-c, storage-d
storage-c is replicated to storage-d, storage-e
storage-d is replicated to storage-e, storage-a
storage-e is replicated to storage-a, storage-b

In the case of 3 storages, they form a complete graph.

During implementation, we structure each partition as equivalent to one Raft group. This inevitable decision leads to two interesting consequences:

We could not enforce the strict replication order (a -> b, c). After creation, a repository's leader role might float freely between a, b, and c. Although Raft tries its best to stabilize leadership role, there is no guarantee about its location.
A storage must start a Raft group to replicate its changes. There must be a way to signal other nodes to start and a corresponding Raft group. In addition, the list of repositories must be persisted.

The first consequence is not necessarily a bad thing. In contrast, it increases the resilience of the cluster. Let's examine this example. Each storage holds the leadership role of 2 repositories and replica role of another 4 repositories (from 2 other nodes).

	storage-a	storage-b	storage-c	storage-d	storage-e
Leader of	1, 2	3, 4	5, 6	7, 8	9, 10
Replica of	7, 8, 9, 10	9, 10, 1, 2	1, 2, 3, 4	3, 4, 5, 6	5, 6, 7, 8

When storage-b goes down, repositories, in which storage-b holds the leader role, are evicted to storage-c and storage-d equally. Each repository's quorum is still maintained.

	storage-a	🚫 storage-b	storage-c	storage-d	storage-e
Leader of	1, 2		4, 5, 6	3, 7, 8	9, 10
Replica of	7, 8, 9, 10		1, 2, 3	4, 5, 6	5, 6, 7, 8

When storage-b is back, the leaders of repositories might shuffle, but in general, the number of leaders and replicas won't change.

The second one could be solved by special "replica Raft groups".

When a storage-a starts, it starts a "replica Raft group" in which it's the only voting member and storage-b and storage-c are non-voting members. It's also the non-voting member of replica Raft groups hosted by storage-d and storage-e.
When a repository (repository-1 for example) is created in storage-a, storage-a starts a repository Raft group in which storage-a is the initial member. It then writes into the replica Raft group. storage-b and storage-c pick up the new log, persist it, and start corresponding repository Raft groups. repository-1's Raft group has three voting members afterward: storage-a, storage-b, storage-c.
If storage-a goes down, its replica Raft group doesn't function. That's fine because it's not capable of creating a new repository anyway. The quorum of repository-1's Raft group is maintained. Hence, it performs a new election term and evicts its leader to either storage-b or storage-c.
When storage-a goes back, it starts all repositories Raft groups from repositories of replica Raft groups of which it's a member. They are the repositories that storage-a needs to maintain. If storage-d or storage-e creates new repositories while it's away, storage-a is aware of them after it fetches the newest log entries.

As in #6104, we reason about keeping relative paths as SSOT. However, as dragonboat doesn't allow using a string as ShardID, we need to maintain a hashmap between {relative_path: shard_id}.

Edited Jun 04, 2024 by Quang-Minh Nguyen

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information