UpdateRepositoryStorageService should clean up after itself upon encountering a failure
Problem to solve
As a GitLab admin, I want to have the subject of a ReplicateRepository
GRPC method invocation failure cleaned up, so that I can re-attempt without having to manually delete the failed replica repository directory from its gitaly storage shard file system by hand.
As I have learned Gitaly is not the appropriate place for such an implementation, but rather the UpdateRepositoryStorageService
should be responsible for cleaning up after a failure detected from the invocation of the ReplicateRepository
GRPC method, here: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/update_repository_storage_service.rb#L84
The problem solved here is that when a failure occurs, the failed replica repository is left behind on the destination storage shard file system, and no further attempts may be made, since the repository directory will already exist on the file system of the destination gitaly shard. This causes a situation in which, at best, gitaly will attempt to set up a second remote configuration and fetch from it into the already existing repository, and at worst, an admin user may give up on the replication or assume it has been successful, and the failed replica repository will continue to exist dormant and unusable on the destination gitaly shard file system indefinitely.
Intended users
A user with the Sidney persona will use this feature.
User experience goal
The user should be able to perform a repository replication (move/migration) to another shard without fear that the replica repository will be left in a broken or inconsistent state.
Proposal
Detect all failures from the invocation of the Repository#replicate
method, and delete all artifacts created in association with the Service-level procedure on the destination gitaly shard file system.
This clean-up action presents zero risk of customer data loss, since the original repository is never modified (only read) during the replication procedure.
Further details
Here is an example failure from which clean-up was required.
Permissions and Security
Permissions required are those that are currently required, an administrator-level access API token.
Documentation
Here is the current documentation of manual clean-up procedures.
Availability & Testing
TBD
What does success look like, and how can we measure that?
Success looks like having no git repository at the /var/opt/gitlab/git-data/repositories/$disk_path.git
path on the file system of the destination gitaly shard when there is any failure encountered by the UpdateRepositoryStorageService
procedure.
What is the type of buyer?
All buyers/applicability uncertain.
Is this a cross-stage feature?
Uncertain/possibly Gitaly.