UpdateRepositoryStorageService should clean up after itself upon encountering a failure

Problem to solve

As a GitLab admin, I want to have the subject of a ReplicateRepository GRPC method invocation failure cleaned up, so that I can re-attempt without having to manually delete the failed replica repository directory from its gitaly storage shard file system by hand.

As I have learned Gitaly is not the appropriate place for such an implementation, but rather the UpdateRepositoryStorageService should be responsible for cleaning up after a failure detected from the invocation of the ReplicateRepository GRPC method, here: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/update_repository_storage_service.rb#L84

The problem solved here is that when a failure occurs, the failed replica repository is left behind on the destination storage shard file system, and no further attempts may be made, since the repository directory will already exist on the file system of the destination gitaly shard. This causes a situation in which, at best, gitaly will attempt to set up a second remote configuration and fetch from it into the already existing repository, and at worst, an admin user may give up on the replication or assume it has been successful, and the failed replica repository will continue to exist dormant and unusable on the destination gitaly shard file system indefinitely.

Intended users

A user with the Sidney persona will use this feature.

User experience goal

The user should be able to perform a repository replication (move/migration) to another shard without fear that the replica repository will be left in a broken or inconsistent state.

Proposal

Detect all failures from the invocation of the Repository#replicate method, and delete all artifacts created in association with the Service-level procedure on the destination gitaly shard file system.

This clean-up action presents zero risk of customer data loss, since the original repository is never modified (only read) during the replication procedure.

Further details

Here is an example failure from which clean-up was required.

Permissions and Security

Permissions required are those that are currently required, an administrator-level access API token.

Documentation

Here is the current documentation of manual clean-up procedures.

Availability & Testing

TBD

What does success look like, and how can we measure that?

Success looks like having no git repository at the /var/opt/gitlab/git-data/repositories/$disk_path.git path on the file system of the destination gitaly shard when there is any failure encountered by the UpdateRepositoryStorageService procedure.

What is the type of buyer?

All buyers/applicability uncertain.

Is this a cross-stage feature?

Uncertain/possibly Gitaly.

Links / references