Detach repository group and path from the folder in disk

For bonus points we could use the same technique used by git and use the first 2 chars of the UUID as a first level folder to group them and avoid making mistakes like double tab as root in the host and taking it down on bash expansion.

This is a great idea. We'll clearly need to support both paths while migrating, which will be the biggest pain point. Although it's a filesystem layout change, not a database change, a lot of the comments in https://gitlab.com/gitlab-org/gitlab-ce/issues/26130 would apply here too.

Yes it will take a while, but we can just fill the path with what makes sense now, and then perform the migration on the backend while GitLab.com is running, same way we do with migrating repos from one shard to the other, this can be treated the same (except that we should use mv instead of rsync to avoid duplicating the needed storage)

I had an idea that I thought was reasonable, but won't work. I'm putting it here for the benefit of others, so that they don't make the same mistake My idea was to do a 'slow move', like we'd do in the database:

In release A, add hard links from the existing storage paths to the new path format, on the same mount.
In release B, use the new path format.
In release C, remove the links from the old formats, so that only the new formats are available.

However, this won't work, because we use NFS For example: https://community.mapr.com/thread/7986

We'll probably just need a fallback while we migrate, to check first the new path, and then the old path if that doesn't exist, like in my previous comment.

@smcgivern

I'm thinking that we could do this not as a migration but as a maintenance task - we ship the rename, and every time we move a project across shards we use a new folder, this way we could just be moving things from one shard to the next detaching them from the current repo name using the same behavior we have right now.

@pcarranza when would we stop supporting the old names - 10.0? We'd have to at some point, right?

Yes

But the path can be there unchanged forever. Even when it's the old one. And deleting and creating again will have the exact same effect that we're looking for. There is no need to force people to migrate everything at all.

@pcarranza that's an interesting idea. At the moment we don't store the project's full path in the DB, so we'd have to start doing that - or at least support both schemes for a while.

If we did store the project's full path in the DB (probably by storing its storage and path within that storage separately), then we could do the move as you described, and default to the new scheme for new projects. If a project doesn't have its path in the database, then we assume it uses the old scheme.

This would also impact https://docs.gitlab.com/ce/raketasks/import.html, which I didn't even know existed!

mentioned in issue #28811 (closed)

Yeah. What i mean is that by doing it that way we don't need a migration and its completely optional to change the old records. Yet we have the benefits right away.

I think something along these lines is reasonable:

If we did store the project's full path in the DB (probably by storing its storage and path within that storage separately), then we could do the move as you described, and default to the new scheme for new projects. If a project doesn't have its path in the database, then we assume it uses the old scheme.

@DouweM wdyt?

@smcgivern Sounds reasonable and avoids a lot of work.

One thing to note is that we need to store the path separated from the shard and then construct it, but I think that it was already assumed.

@smcgivern Sounds reasonable to me.

Cool! @mydigitalself can schedule this as he sees fit.

added ~155949 label

Scheduling-wise, not critical for production, but really neat to have, particularly from the support perspective of people deleting and then trying to create a new repo with the same name. So I would gauge interest with @lbot here.

@pcarranza thanks for looping in. @dblessing @MrChrisW, do you have a sense of UUID vs path name here? I think ya'll can better gauge then me.

@lbot the result of this is that you will not see a user complaining about: I deleted the project to create another one with the same name and I can't.

@pcarranza @smcgivern sounds good to me and makes sense.

added ~1222504 label

This is very similar to https://gitlab.com/gitlab-org/gitlab-ce/issues/28112.

@mydigitalself would you mind combining these two?

+1 for the UUID approach. Consider having UUID per namespace, and also a new table/model in the DB that actually tracks UUID issuance as a service which proactively prevents re-use in the event of an unlikely collision.

Longer term, the UUID approach is also good for storage sharding across globally distributed nodes using a DHT mapping.

@smcgivern @pcarranza

mentioned in issue #29925 (closed)

Geo has a strong interest in having this, since this would make project deletion and renaming a lot easier.

Right now there is an assumption that the path of the Wiki repo has the suffix .wiki.git. We would have to store the path in the database and remove this assumption.

changed milestone to %10.0

mentioned in issue #28112 (closed)

@brodock brings up the point about how this would work for Geo: can we make the secondary use this new scheme while the primary uses the original one? Do we want this for GitLab.com, so that we don't have to migrate all projects on the primary before synching them on the secondary?

The issue here is that if we store the full path in the database, that means either:

The primary must migrate all the repositories to the new scheme
The secondary must keep track of its own path somewhere else

I think there is too much room for error, confusion, and inconsistency by having the secondary use different path names than the primary. I think we should make the primary and secondary use the same pathname.

@stanhu the proposal to keep primary and secondary in different "path formats" is just to provide a easier migration path. The intention is that new installations will default to use new format, while existing ones will be able to do that at their own pace (except for the secondary node which will be a requirement, so we can make the sync reliable).

Semantic question: The prior closed issue #28112 (closed) had a milestone of 9.5, this one 9.6. Should we move this to 9.5?

changed milestone to %9.5

assigned to @brodock

@brodock Here is what I propose:

We introduce a UUID for projects
The path is defined to be #{storage_dir}/#{uuid[0..1]}/#{uuid[2..3]]}/#{uuid[4:]}.git
We store a version number in the projects table to indicate the path type (e.g. nil, 0, 1 are legacy paths, 2 is the UUID format). When someone calls Repository#full_path, we return the full path based on the version.
If we migrate a project to a new version, we move the path to the right location.

If someone migrates the primary to the new pathname, Geo secondaries won't have the new path and will just redownload the repository.

@stanhu I would remove symlinks from the plan because you could end up with symlinks across different NFS mount points.

I think that the simpler way to move to a new version is to move to a different shard, this will generate a new path altogether, which will be a UUID.

@pcarranza Yeah, I agree with removing the symlinks from the plan. Too much overhead to clean up links too.

We can just move the new version to the same shard to avoid the cost of moving between filesystem.

@stanhu that sounds interesting.

I think the way we coded the original move of a project was with rsync, this is so because we are indeed moving from one shard to another.

Maybe, in this case we could use a plain mv (which would make it immediate), but I'm not sure how easy that would be if we are moving a repo from a nested group into a new flat path - but I'm open to be corrected.

cc/ @eReGeBe because of coding the original shard-move feature

I'm also wondering if that we need to be careful about throwing every repo in an opaque namespace. For example, https://gitlab.com/gitlab-org/gitlab-ce/issues/28112#proposal-1 proposes that we include the project ID in the path. This might be a good idea to make it easier for admins to backup/find the appropriate repositories and having a clean separation between different projects, which may be important for segregating data.

That's an interesting thought.

I think that the way I would look for a repo is first by going to the DB to get the path (which we can do with ChatOps today) and then would go down to the filesystem.

In our particular case, we already have hundreds of thousands of repos in the filesystem already, which makes it impossible to just walk your way through cd somethin-something-<tab>-<tab> as it will just lead to a locked terminal.

So, it doesn't make it easier to move around at all, we need to get the correct path first to then access.

Backup-wise (better said, recovery), same deal, we would find in which shard the repo is, and bring up the shard to pull up the data.

Detach repository group and path from the folder in disk

Designs

Child items ...

Activity

Detach repository group and path from the folder in disk

Relates to

Activity