Identify how to (logically) replicate data to new shards, and when they are in sync

Regardless of the type of sharding solution we pick, we need a way of replicating data from the old database to these new database(s). This likely requires the use of logical replication, as we only want to replicate a subset of our data. This comes with the challenge of determining when the hosts are in sync, and how to do a transparent switch to the new host. In this issue we should discuss the challenges and approaches we may take.

For now I've set the due date to Friday, but I suspect we may need more time than that.

Proposal

There are likely 2 paths with some overlapping details here. The 2 paths are:

Using Postgres logical replication/decoding to sync the data => Preferred
Using the application (ie. Rails code) to sync the data

The engineering team all thinks option (1) is the best option for meeting our needs as well as simplicity, performance, reliability and the details are as follows.

(1) Postgres logical replication/decoding => Preferred

This algorithm is almost entirely described (but for MySQL) in https://www.usenix.org/conference/srecon19emea/presentation/li . Postgres specifics will be determined in #329977 (closed) but we believe it's all feasible.

Use Postgres logical replication/decoding. The logic for moving group-1 from shard-0 to shard-1 will be:

GitLab starts a GroupShardMoveWorker
Configure Postgres to replicate all data belonging to group-1 from shard-0 to shard-1
1. Such data will be WHERE namespace_id IN (...all subgroups of group-1 and group-1 ids) OR project_id IN (...all projects in group-1 and it's subgroups ids) (see universal sharding IDs)
Postgres will do an initial copy of all data plus a stream of all updates
1. Following https://paquier.xyz/postgresql-2/postgres-9-5-feature-highlight-pg-dump-snapshots/ our initial copy will need to first create a logical replication slot and get a snapshot ID
2. The snapshot ID will then need to be used to generate the initial data copy, carefully not losing the initial connection before starting the new transaction with this snapshot ID
3. The copy can use any SELECT or pg_dump or any other normal Postgres queries as it is just in a normal Postgres transaction
GitLab waits until the stream is almost caught up (some threshold of 1s lag should be fine)
GroupShardMoveWorker acquires an exclusive lock for group-1 (see Shared/Exclusive locking
GroupShardMoveWorker waits until the stream of updates is empty (writing is paused so this should take around 1s)
(optional but advisable) GroupShardMoveWorker does a validation check (checksum of all rows/columns in single number to compare shard-0 with shard-1)
1. As described in the linked presentation the most likely cause of validation failure will be schema changes that happened during the move. Given the frequency of GitLab deployments and migrations we may want an automated way to coordinate such that we never try moving a group while migrations are in progress. Or we abort a failed move and just retry it later.
Sharding tables are updated to reflect group-1 now belongs to shard-1
GroupShardMoveWorker releases exclusive lock for group-1

Pros

Postgres replication is likely the fastest/cheapest/most reliable way to sync the databases
Postgres replication skips any Rails code (eg. validations/hooks) that might possibly update the data and therefore make it impossible to validate the move is correct

Cons

There will be small write downtime for a group due to the use of an exclusive lock (should be under a few seconds and our backend could retry any writes that happened and therefore it is just 3s latency during this window)
We are limited to the features available in Postgres and may therefore require to add extra columns (denormalized sharding IDs) or other data to make this work

(2) Using the application (ie. Rails code) to sync the data

The specifics here are left mostly vague since we don't believe this is a good option nor an easy option.

The easiest way to do this in Rails would be to do pretty much the same thing as described for Postgres logical replication/decoding above except we load all the data in Rails and write it to shard-1 followed by creating a queue of updates that started after the initial load which are replayed to shard-1 (these need to be played in order after the initial load so they need to be queued otherwise we'll try updating/deleting something that does not exist.

Pros

It's in Rails code so we can be more flexible with how our data is modelled
There are theoretical ways to do dual writing that require no downtime at all but are possibly more complicated to implement

Cons

It will be very slow and very high CPU consuming (marshalling/unmarshalling via Ruby is hard on the CPU)
There is basically no way to validate the move was successful as any number of Rails hooks or validations may change the resulting data to be different and we don't know if that different is actually broken or just a minor difference
The "queue" mechanism described for buffering and replaying updates that happen during initial load does not exist and it would be expensive to implement and likely error prone given the complexity involved in all the data models being transformed here
There is a high likelihood to make the solution correct we would have to extensively fix a lot of Rails hooks that are updating legacy data that we don't want updated in ways that create a poor UX. We also have almost no way to detect these ahead of time as we believe there may be no way to properly validate the update is complete
This introduces an extra pattern for writing Rails data that bypasses our normally sharding mechanisms as we are writing to a different shard and thus more complex Rails code

(3) Using Import/Export features of GitLab

There was also a discussion about using Import/Export to do the move. We haven't explored this in detail. It could theoretically work but it will likely suffer from many of the problems of doing this in Rails code plus some more so we are ruling this out for now but could always consider this as a backup if all else fails.

Cons

Import/Export creates copies of each record meaning they will have a new DB ID generated. New DB ID means that any API calls will break for using expecting stable IDs before/after a migration
Import/Export was never designed to copy everything and as such it skips a lot of different important records that a user would not want to lose. For example build logs, artifacts, container registry images
Aside from creating new DB ids it also only works for a newly created group with a new path so this would be even less seamless for users
Import/Export already does not scale to large groups and some take many hours to migrate and downtime for many hours would not be ideal and given the design of Import/Export it would be much slower than other Ruby based approaches

Basically we could overcome all of the above by making changes to how Import/Export works but it would end up being the same as option (2) and would suffer from all of it's flaws and probably ends up being more work than option (1).

(4) Lots of write downtime moving with Ruby or Postgres whichever is easier

Basically the problem starts to become quite a bit simpler if we're willing to accept a few hours of write downtime when moving a group. We don't need to worry about logical replication and transactions and so on. The algorithm is simply to:

pause writes to a specific group (see Shared/Exclusive locking)
Copy all the data for the group across => the time taken may vary from minutes to hours for the largest groups
1. We'd definitely want to avoid using Ruby models here as they are slow but we may want to do as little in Ruby as possible to speed this up
Change the sharding location of the group
Unpause writes

Universal sharding IDs

Per the above proposed solution we need a single database specific way of identifying all data that belongs to a top level group. The ideal solution (and most common for sharded databases) is that every record embeds the sharding key. In our case this would be the top_level_namespace_id. Since we have hundreds of tables it may be difficult to embed this top_level_namespace_id in every record. An alternative proposed is a rule that every sharded table at least has a project_id or namespace_id which gives us only at most 2 ways to query for every record that belongs to the group. Most records today do have a project_id but there are some project objects that are nested child objects that do not have this (eg. award_emoji). This concept of storing these IDs on deeply nested child objects is called "de-normalization". We should explore whether it's feasible to de-normalize project_id on everything (likely easiest) or even better top_level_namespace_id which will make a lot of sharding problems much simpler.

Some related topics we can/should explore later:

Shared/Exclusive locking

This will be worked on further in #329976 (closed)

Described in https://www.usenix.org/conference/srecon19emea/presentation/li this algorithm is useful for the kind of locking we need to block writes for a specific group.

As such there will be 2 types of locks for a group:

Shared lock:
1. This will be acquired by any process that needs a "write" connection to a shard for a specific group
2. There can be multiple shared locks at a time
3. A process much release a shared lock once it's done writing
4. A process must not accidentally hold a shared lock forever if it crashes
Exclusive lock
1. The process which moves group will need an exclusive lock for that group for a brief period of time
2. An exclusive lock can only be obtained when all shared locks are released
3. Once an exclusive lock is obtained for a group you can no longer acquire any shared locks for this group
4. The move worker should not accidentally hold the exclusive lock forever if it crashes

We will need to choose an appropriate technology to keep track of these locks in our distributed system.

We will evaluate at least the following options:

Redis https://redis.io/topics/distlock
Postgres dedicated lock table #329308 (comment 568878329)
1. Per the above thread we need to ensure that the Postgres locking is not used on our main database with massive amounts of writes as it will lead to huge amounts of dead tuples that cannot be cleaned up until the lock is freed. We could simply just move this locking to a dedicated logical database to mitigate this problem.

From what I can tell neither of these will give us a guarantee of no writes happening (ie. like a proper read-only database would) but no such option seems possible given the specifics of our domain. Even the Postgres lock option does not give this guarantee since the dedicated lock table is not the table people will be writing to.

We should evaluate based on:

We should build as much into our data access layers/verification/tooling as possible to make it very difficult for an application developer to forget to obtain the lock. Any mistakes in skipping the lock could lead to data loss/corruption.
We need incredibly low latency since this will be requested for every action in GitLab that writes to the database (eg. dedicated Redis instance)
We want it to be very difficult/impossible for a lock to be stuck forever if a process crashes after obtaining a lock. A usual pattern we have with Redis is lock timeouts. But timeouts can be wrong too if the process happens to take longer than the timeout so we may need something more sophisticated like heartbeats or persistent TCP connections to hold a lock.

Confidence of due date May 7 => 100%

Edited May 07, 2021 by Dylan Griffith (ex GitLab)