Group path rename may fail partway through, leaving repositories on disk in an inconsistent state

Summary

When renaming a namespace that has legacy-storage projects spread across several Gitaly shards, if any but the first RenameNamespace RPC fails, the database and all successfully-executed Gitaly shards will be left out of sync

Steps to reproduce

Create a group hierarchy like this, using projects with legacy (i.e., not hashed) storage:
```
group/
  project1 # repository_storage: default
  project2 # repository_storage: broken
```
A simple way to "break" Gitaly for the second project for this case is simply to create the broken/repositories/foo directory. This will cause the RenameNamespace RPC to fail.
Rename group to foo

The rename will fail, leaving the group still in the namebase with name group. The user will see an error saying the group could not be renamed.

What is the current bug behavior?

The repository for project2 will stay in its usual place (broken/repositories/group/project2.git)

However, assuming project1 was attempted (and succeeded) before project2, the repository for project1 will now be at default/repositories/foo/project1.git, which is the wrong place for it. Attempting to access /group/project1 in the web UI will fail. Over time, an empty repository may be created for it.

What is the expected correct behavior?

The group rename should be fully transactional. It's not OK to lose repositories if the rename fails partway through.

Output of checks

This bug happens on GitLab.com . Infrastructure issue investigating a particular instance: https://gitlab.com/gitlab-com/infrastructure/issues/4630#note_91287462 (confidential as it contains user details)

Possible fixes

Hashed storage resolves this problem, as there is no need to move the project repositories around at all.

A quick and dirty mitigation would be to track the renames that succeeded, and attempt to reverse them all if any one rename fails. This isn't perfect, but should solve most of the cases. If the Unicorn process dies partway through, or a previously-working Gitaly breaks on us, we could still be left in an inconsistent state.

A more robust mitigation would be to implement some form of 2- or 3-phase commit protocol. This is a lot more difficult, but would allow us to prepare for the rename on all Gitaly shards, and be assured of failback to the correct filesystem paths if that rename failed on any one of them.

The problematic code is here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/models/concerns/storage/legacy_namespace.rb#L52

What we can't do is call rm_namespace on the destination before doing the rename. That could lead to data loss.

I expect this explains where some subset of missing repositories seen on the primary in the GCP Migration have gone /cc @stanhu

Edited Jul 31, 2018 by Nick Thomas