Repository storage moves should replicate object pools

When moving a repository to a different storage, linked object pools are not recreated on the target storage or linked to. Consequently, the object pool relationships stored in GitLab Rails are also dropped (code). Object pools and their relationships should be preserved and recreated across storage moves to maintain the storage benefits of object deduplication.

Something to keep in mind is that preserving the project's object pool relationship is not as simple as moving the object pool and updating the its assigned storage in the pool_repositories table. Object pools can be shared between multiple repositories. Moving a repository that uses an object pool to a new storage does not mean that the object pool is not needed on the source storage anymore. There could still be repositories on the source storage using it. This means the same object pool with the same relative path must be capable of existing on multiple storages. This currently not possible due to the following unique index on the pool_repositories table:

CREATE UNIQUE INDEX index_pool_repositories_on_disk_path ON pool_repositories USING btree (disk_path);

In !124791 (merged) the index is being updated to following index to use both disk_path and shard_id which will allow an object pool to exist on multiple storages:

CREATE UNIQUE INDEX unique_pool_repositories_on_disk_path_and_shard_id ON pool_repositories USING btree (disk_path, shard_id);

When a repository storage move is performed in Rails, if the repository is linked to an object pool, the required object pool should be replicated to the target storage if it does already exist and the repository linked to it. To facilitate this, Rails can coordinate additional Gitaly RPCs to perform the object pool replication and linking.

Steps to perform these operations:

Invoke ReplicateRepository() on the source repository to replicate it onto the target storage. (This is already done currently)
Consult the pool_repositories table to verify the source repository is linked to an object pool. If the repository is not linked to an object pool, no additional steps are required and the storage move can proceed as it normally does currently.
If the repository is linked to an object pool, check if the pool_repositories tables already has an entry for the object pool on the target storage. If the required object pool already exists on the target storage, there is no need to replicate it again. This saves us some time.
If the required object pool does not exist on the target storage, invoke ReplicateRepository() a second time. This time to replicate the source repository's object pool onto the target storage. Once replication is complete, a new entry in the pool_repositories table must be created to track this object pool's existence.
Now that we know both the repository and its required object pool exist on the target storage, invoke LinkRepositoryToObjectPool() to connect the two repositories. Currently Rails manages the lifecycle of object pools so this new relationship also must be tracked. Otherwise, the object pool could be prematurely cleaned up leading to repository corruption.

These steps should result in proper replication and linking of object pools for both Gitaly and Praefect deployments.

Something else to consider is the impacts that having object pools exist on multiple storages will have on the FetchIntoObjectPool() RPC. This RPC is responsible for updating an object pool by fetching objects from the primary object pool member. Once objects pool are capable of existing on multiple storages, there is no longer a guarantee that the object pool and its primary member will exist on the same storage. This is problematic for FetchIntoObjectPool() because it requires the repositories to coexist on the same storage. Consequently, object pools on storages without their primary member will be frozen an unable to be updated. This behavior is fine for the time being, but if the primary object pool member is moved to a new storage, it is important that the target object pool begins receiving updates via FetchIntoObjectPool(). Otherwise, storage moves of primary object pool members would freeze the object pool indefinitely and prevent future object deduplication.

An object pool entry in the pool_repositories table uses the source_repository_id field to specify its primary object pool member. When the primary object pool member moves storages, the corresponding object pool entry in pool_repositories must have the correct source_repository_id.

/cc @andrashorvath @proglottis

Availability and Testing

Regression testing, please ensure associated MR is labelled with pipeline:run-all-e2e and e2e:package-and-test job is passing.

Edited Jul 18, 2023 by Jay McCure