The simplest way to do this is by using an UUID for the folder name instead of using the current namespace/name.
There are multiple benefits of doing this, but I'll just mention 2 that I think are major:
Security: in order to identify a repo that you may want to get out of the disk you need to also compromise the database. This makes it harder to extract data from GitLab in case of a security breach.
Reusing group and project names after deletion stops being a problem - it's virtually impossible to get the same UUID.
This shouldn't be so much of a change to the application. But I think that the main issue will be to migrate all the projects around in the filesystem (a simple mv command would suffice) simply because it will take long.
For bonus points we could use the same technique used by git and use the first 2 chars of the UUID as a first level folder to group them and avoid making mistakes like double tab as root in the host and taking it down on bash expansion.
This is a great idea. We'll clearly need to support both paths while migrating, which will be the biggest pain point. Although it's a filesystem layout change, not a database change, a lot of the comments in https://gitlab.com/gitlab-org/gitlab-ce/issues/26130 would apply here too.
Yes it will take a while, but we can just fill the path with what makes sense now, and then perform the migration on the backend while GitLab.com is running, same way we do with migrating repos from one shard to the other, this can be treated the same (except that we should use mv instead of rsync to avoid duplicating the needed storage)
I had an idea that I thought was reasonable, but won't work. I'm putting it here for the benefit of others, so that they don't make the same mistake My idea was to do a 'slow move', like we'd do in the database:
In release A, add hard links from the existing storage paths to the new path format, on the same mount.
In release B, use the new path format.
In release C, remove the links from the old formats, so that only the new formats are available.
We'll probably just need a fallback while we migrate, to check first the new path, and then the old path if that doesn't exist, like in my previous comment.
I'm thinking that we could do this not as a migration but as a maintenance task - we ship the rename, and every time we move a project across shards we use a new folder, this way we could just be moving things from one shard to the next detaching them from the current repo name using the same behavior we have right now.
But the path can be there unchanged forever. Even when it's the old one.
And deleting and creating again will have the exact same effect that we're
looking for. There is no need to force people to migrate everything at all.
@pcarranza that's an interesting idea. At the moment we don't store the project's full path in the DB, so we'd have to start doing that - or at least support both schemes for a while.
If we did store the project's full path in the DB (probably by storing its storage and path within that storage separately), then we could do the move as you described, and default to the new scheme for new projects. If a project doesn't have its path in the database, then we assume it uses the old scheme.
Yeah. What i mean is that by doing it that way we don't need a migration
and its completely optional to change the old records. Yet we have the
benefits right away.
I think something along these lines is reasonable:
If we did store the project's full path in the DB (probably by storing its storage and path within that storage separately), then we could do the move as you described, and default to the new scheme for new projects. If a project doesn't have its path in the database, then we assume it uses the old scheme.
Scheduling-wise, not critical for production, but really neat to have, particularly from the support perspective of people deleting and then trying to create a new repo with the same name. So I would gauge interest with @lbot here.
@lbot the result of this is that you will not see a user complaining about: I deleted the project to create another one with the same name and I can't.
+1 for the UUID approach. Consider having UUID per namespace, and also a new table/model in the DB that actually tracks UUID issuance as a service which proactively prevents re-use in the event of an unlikely collision.
Longer term, the UUID approach is also good for storage sharding across globally distributed nodes using a DHT mapping.
Geo has a strong interest in having this, since this would make project deletion and renaming a lot easier.
Right now there is an assumption that the path of the Wiki repo has the suffix .wiki.git. We would have to store the path in the database and remove this assumption.
@brodock brings up the point about how this would work for Geo: can we make the secondary use this new scheme while the primary uses the original one? Do we want this for GitLab.com, so that we don't have to migrate all projects on the primary before synching them on the secondary?
The issue here is that if we store the full path in the database, that means either:
The primary must migrate all the repositories to the new scheme
The secondary must keep track of its own path somewhere else
I think there is too much room for error, confusion, and inconsistency by having the secondary use different path names than the primary. I think we should make the primary and secondary use the same pathname.
@stanhu the proposal to keep primary and secondary in different "path formats" is just to provide a easier migration path. The intention is that new installations will default to use new format, while existing ones will be able to do that at their own pace (except for the secondary node which will be a requirement, so we can make the sync reliable).
The path is defined to be #{storage_dir}/#{uuid[0..1]}/#{uuid[2..3]]}/#{uuid[4:]}.git
We store a version number in the projects table to indicate the path type (e.g. nil, 0, 1 are legacy paths, 2 is the UUID format). When someone calls Repository#full_path, we return the full path based on the version.
If we migrate a project to a new version, we move the path to the right location.
If someone migrates the primary to the new pathname, Geo secondaries won't have the new path and will just redownload the repository.
I think the way we coded the original move of a project was with rsync, this is so because we are indeed moving from one shard to another.
Maybe, in this case we could use a plain mv (which would make it immediate), but I'm not sure how easy that would be if we are moving a repo from a nested group into a new flat path - but I'm open to be corrected.
cc/ @eReGeBe because of coding the original shard-move feature
I'm also wondering if that we need to be careful about throwing every repo in an opaque namespace. For example, https://gitlab.com/gitlab-org/gitlab-ce/issues/28112#proposal-1 proposes that we include the project ID in the path. This might be a good idea to make it easier for admins to backup/find the appropriate repositories and having a clean separation between different projects, which may be important for segregating data.
I think that the way I would look for a repo is first by going to the DB to get the path (which we can do with ChatOps today) and then would go down to the filesystem.
In our particular case, we already have hundreds of thousands of repos in the filesystem already, which makes it impossible to just walk your way through cd somethin-something-<tab>-<tab> as it will just lead to a locked terminal.
So, it doesn't make it easier to move around at all, we need to get the correct path first to then access.
Backup-wise (better said, recovery), same deal, we would find in which shard the repo is, and bring up the shard to pull up the data.