Group path rename may fail partway through, leaving repositories on disk in an inconsistent state
Summary
When renaming a namespace that has legacy-storage projects spread across several Gitaly shards, if any but the first RenameNamespace
RPC fails, the database and all successfully-executed Gitaly shards will be left out of sync
Steps to reproduce
-
Create a group hierarchy like this, using projects with legacy (i.e., not hashed) storage:
group/ project1 # repository_storage: default project2 # repository_storage: broken
A simple way to "break" Gitaly for the second project for this case is simply to create the
broken/repositories/foo
directory. This will cause theRenameNamespace
RPC to fail. -
Rename
group
tofoo
The rename will fail, leaving the group still in the namebase with name group
. The user will see an error saying the group could not be renamed.
What is the current bug behavior?
The repository for project2
will stay in its usual place (broken/repositories/group/project2.git
)
However, assuming project1
was attempted (and succeeded) before project2
, the repository for project1
will now be at default/repositories/foo/project1.git
, which is the wrong place for it. Attempting to access /group/project1
in the web UI will fail. Over time, an empty repository may be created for it.
What is the expected correct behavior?
The group rename should be fully transactional. It's not OK to lose repositories if the rename fails partway through.
Output of checks
This bug happens on GitLab.com . Infrastructure issue investigating a particular instance: https://gitlab.com/gitlab-com/infrastructure/issues/4630#note_91287462 (confidential as it contains user details)
Possible fixes
Hashed storage resolves this problem, as there is no need to move the project repositories around at all.
A quick and dirty mitigation would be to track the renames that succeeded, and attempt to reverse them all if any one rename fails. This isn't perfect, but should solve most of the cases. If the Unicorn process dies partway through, or a previously-working Gitaly breaks on us, we could still be left in an inconsistent state.
A more robust mitigation would be to implement some form of 2- or 3-phase commit protocol. This is a lot more difficult, but would allow us to prepare for the rename on all Gitaly shards, and be assured of failback to the correct filesystem paths if that rename failed on any one of them.
The problematic code is here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/models/concerns/storage/legacy_namespace.rb#L52
What we can't do is call rm_namespace
on the destination before doing the rename. That could lead to data loss.
I expect this explains where some subset of missing repositories seen on the primary in the GCP Migration have gone /cc @stanhu