Move a group to a new shard

POC

Demo: https://www.youtube.com/watch?v=NOoUM5YzQ54
Code: !61414 (closed)

Problem to solve

We know that we need to move top-level namespaces to new shards either for rebalancing or to move them from the current database. We should show how it would be possible to move a single group to a new database

Solution

Use PG logical replication to replicate a group (top-level) namespace? To a new shard

The preferred algorithm is described in #329308 (closed) . This issue should focus on the Postgres parts of this algorithm. The biggest risk (I see) is going to be features that are missing in Postgres so we should de-risk by trying to implement the Postgres parts quickly.

Technical steps

This algorithm is almost entirely described (but for MySQL) in https://www.usenix.org/conference/srecon19emea/presentation/li . Postgres specifics were discussed in this call https://youtu.be/0GtMDSKMCd4 and most details are described in https://paquier.xyz/postgresql-2/postgres-9-5-feature-highlight-pg-dump-snapshots/ .

The logic for moving group-1 from shard-0 to shard-1 will be:

GitLab starts a GroupShardMoveWorker
Configure Postgres to replicate all data belonging to group-1 from shard-0 to shard-1
1. Such data will be WHERE namespace_id IN (...all subgroups of group-1 and group-1 ids) OR project_id IN (...all projects in group-1 and it's subgroups ids) (see universal sharding IDs)
Postgres will do an initial copy of all data plus a stream of all updates
1. Following https://paquier.xyz/postgresql-2/postgres-9-5-feature-highlight-pg-dump-snapshots/ our initial copy will need to first create a logical replication slot and get a snapshot ID
2. The snapshot ID will then need to be used to generate the initial data copy, carefully not losing the initial connection before starting the new transaction with this snapshot ID
3. The copy can use any SELECT or pg_dump or any other normal Postgres queries as it is just in a normal Postgres transaction
GitLab waits until the stream is almost caught up (some threshold of 1s lag should be fine)
GroupShardMoveWorker acquires an exclusive lock for group-1 (see Shared/Exclusive locking
GroupShardMoveWorker waits until the stream of updates is empty (writing is paused so this should take around 1s)
(optional but advisable) GroupShardMoveWorker does a validation check (checksum of all rows/columns in single number to compare shard-0 with shard-1)
1. As described in the linked presentation the most likely cause of validation failure will be schema changes that happened during the move. Given the frequency of GitLab deployments and migrations we may want an automated way to coordinate such that we never try moving a group while migrations are in progress. Or we abort a failed move and just retry it later.
Sharding tables are updated to reflect group-1 now belongs to shard-1
GroupShardMoveWorker releases exclusive lock for group-1

Resources

Confidence of due date May 12 => 80%

Edited Feb 22, 2024 by Dylan Griffith (ex GitLab)