The simplest way to do this is by using an UUID for the folder name instead of using the current namespace/name.
There are multiple benefits of doing this, but I'll just mention 2 that I think are major:
Security: in order to identify a repo that you may want to get out of the disk you need to also compromise the database. This makes it harder to extract data from GitLab in case of a security breach.
Reusing group and project names after deletion stops being a problem - it's virtually impossible to get the same UUID.
This shouldn't be so much of a change to the application. But I think that the main issue will be to migrate all the projects around in the filesystem (a simple mv command would suffice) simply because it will take long.
For bonus points we could use the same technique used by git and use the first 2 chars of the UUID as a first level folder to group them and avoid making mistakes like double tab as root in the host and taking it down on bash expansion.
This is a great idea. We'll clearly need to support both paths while migrating, which will be the biggest pain point. Although it's a filesystem layout change, not a database change, a lot of the comments in https://gitlab.com/gitlab-org/gitlab-ce/issues/26130 would apply here too.
Yes it will take a while, but we can just fill the path with what makes sense now, and then perform the migration on the backend while GitLab.com is running, same way we do with migrating repos from one shard to the other, this can be treated the same (except that we should use mv instead of rsync to avoid duplicating the needed storage)
I had an idea that I thought was reasonable, but won't work. I'm putting it here for the benefit of others, so that they don't make the same mistake My idea was to do a 'slow move', like we'd do in the database:
In release A, add hard links from the existing storage paths to the new path format, on the same mount.
In release B, use the new path format.
In release C, remove the links from the old formats, so that only the new formats are available.
We'll probably just need a fallback while we migrate, to check first the new path, and then the old path if that doesn't exist, like in my previous comment.
I'm thinking that we could do this not as a migration but as a maintenance task - we ship the rename, and every time we move a project across shards we use a new folder, this way we could just be moving things from one shard to the next detaching them from the current repo name using the same behavior we have right now.
But the path can be there unchanged forever. Even when it's the old one.
And deleting and creating again will have the exact same effect that we're
looking for. There is no need to force people to migrate everything at all.
@pcarranza that's an interesting idea. At the moment we don't store the project's full path in the DB, so we'd have to start doing that - or at least support both schemes for a while.
If we did store the project's full path in the DB (probably by storing its storage and path within that storage separately), then we could do the move as you described, and default to the new scheme for new projects. If a project doesn't have its path in the database, then we assume it uses the old scheme.
Yeah. What i mean is that by doing it that way we don't need a migration
and its completely optional to change the old records. Yet we have the
benefits right away.
I think something along these lines is reasonable:
If we did store the project's full path in the DB (probably by storing its storage and path within that storage separately), then we could do the move as you described, and default to the new scheme for new projects. If a project doesn't have its path in the database, then we assume it uses the old scheme.
Scheduling-wise, not critical for production, but really neat to have, particularly from the support perspective of people deleting and then trying to create a new repo with the same name. So I would gauge interest with @lbot here.
@lbot the result of this is that you will not see a user complaining about: I deleted the project to create another one with the same name and I can't.
+1 for the UUID approach. Consider having UUID per namespace, and also a new table/model in the DB that actually tracks UUID issuance as a service which proactively prevents re-use in the event of an unlikely collision.
Longer term, the UUID approach is also good for storage sharding across globally distributed nodes using a DHT mapping.
Geo has a strong interest in having this, since this would make project deletion and renaming a lot easier.
Right now there is an assumption that the path of the Wiki repo has the suffix .wiki.git. We would have to store the path in the database and remove this assumption.
@brodock brings up the point about how this would work for Geo: can we make the secondary use this new scheme while the primary uses the original one? Do we want this for GitLab.com, so that we don't have to migrate all projects on the primary before synching them on the secondary?
The issue here is that if we store the full path in the database, that means either:
The primary must migrate all the repositories to the new scheme
The secondary must keep track of its own path somewhere else
I think there is too much room for error, confusion, and inconsistency by having the secondary use different path names than the primary. I think we should make the primary and secondary use the same pathname.
@stanhu the proposal to keep primary and secondary in different "path formats" is just to provide a easier migration path. The intention is that new installations will default to use new format, while existing ones will be able to do that at their own pace (except for the secondary node which will be a requirement, so we can make the sync reliable).
The path is defined to be #{storage_dir}/#{uuid[0..1]}/#{uuid[2..3]]}/#{uuid[4:]}.git
We store a version number in the projects table to indicate the path type (e.g. nil, 0, 1 are legacy paths, 2 is the UUID format). When someone calls Repository#full_path, we return the full path based on the version.
If we migrate a project to a new version, we move the path to the right location.
If someone migrates the primary to the new pathname, Geo secondaries won't have the new path and will just redownload the repository.
I think the way we coded the original move of a project was with rsync, this is so because we are indeed moving from one shard to another.
Maybe, in this case we could use a plain mv (which would make it immediate), but I'm not sure how easy that would be if we are moving a repo from a nested group into a new flat path - but I'm open to be corrected.
cc/ @eReGeBe because of coding the original shard-move feature
I'm also wondering if that we need to be careful about throwing every repo in an opaque namespace. For example, https://gitlab.com/gitlab-org/gitlab-ce/issues/28112#proposal-1 proposes that we include the project ID in the path. This might be a good idea to make it easier for admins to backup/find the appropriate repositories and having a clean separation between different projects, which may be important for segregating data.
I think that the way I would look for a repo is first by going to the DB to get the path (which we can do with ChatOps today) and then would go down to the filesystem.
In our particular case, we already have hundreds of thousands of repos in the filesystem already, which makes it impossible to just walk your way through cd somethin-something-<tab>-<tab> as it will just lead to a locked terminal.
So, it doesn't make it easier to move around at all, we need to get the correct path first to then access.
Backup-wise (better said, recovery), same deal, we would find in which shard the repo is, and bring up the shard to pull up the data.
I've run some graphs for 1 million repositories (UUIDs) to see distribution among first 2 characters:
And considering a 4 characters distribution for a ab/cd subfolder format (#{uuid[0..1]}/#{uuid[2..3]}):
Question is, how many repositories do we expect to support in this format for a single machine?
I know that for gitlab.com we are going to the route of sharding into multiple "storage" machines, but what about customers running it with an storage appliance? Does it make sense to consider 1M repositories in a single "file-system"?
Explanation: PostgreSQL UUID stores data as "bytes" of 128bit size. It has a default format despite algorithm used to generate it. We are going to use ruby SecureRandom.uuid which generates compatible data, in a format: 8f4cd249-f0fb-40a0-9fc5-0397a884deab
To run the experiments above I've used this code:
require'securerandom'require'gnuplotrb'# requires gnuplot executable to be availabledefsimulate(samples,title,&block)distribution=samples.times.map{block.call}.group_by(&:itself).map{|k,v|[k,v.length]}.sort.to_hdatapoints=[distribution.keys.map{|k|k.to_i(16)},distribution.values]ds=GnuplotRB::Dataset.new(datapoints,with: 'boxes',title: title)plot=GnuplotRB::Plot.new(ds,title: "#{title} for #{samples} samples",xrange: 0..distribution.keys.count,xlabel: 'digits (decimal)',ylabel: 'n times',style: 'fill solid 1.00')plot.to_png("/tmp/uuid-#{samples}.png")endsimulate1000000,'UUID[0..1]'doSecureRandom.uuid[0..1]end# simulate 1000000, 'UUID[0..3]' do# SecureRandom.uuid[0..3]# end
how does the container registry work today related to storing things? Do we use the "filesystem" as StorageAPI (https://docs.docker.com/registry/storage-drivers/)? If so how do we handle renames today?
I assume it has the same issues projects does today and we need the same approach for Geo, which means a custom URL routing that is different to where it is stored on disk.
Same as question above, we need to apply same behavior to Pages. When we move from "Legacy" format to the new one, we also need to update Pages to use the UUID approach for the root folder otherwise we have the same issue: need to synchronize database with file-system.
Project.first.pages_path=> "/.../gitlab/shared/pages/h5bp/html5-boilerplate"Will became something like (for project UUID `d7ed30df-ef2d-4166-9102-1eca4f201f1c`):=> "/.../gitlab/shared/pages/d7/ed/d7ed30df-ef2d-4166-9102-1eca4f201f1c"
@nick.thomas can probably answer this in more detail, but I'd imagine we'll have to update the config JSON to include the mapping of domain names and pathnames based off this UUID scheme. I hope this helps improve subgroup support too.
We've recently had several issues discussing how to re-organise the pages filesystem data. As long as we have some way of going from domain + path prefix -> UUID, this will be fine.
However, it doesn't really help with subgroup support. The problems there are to do with namespace collisions - some of which already exist, given namespaces and paths can have the . character and get embedded into domain names (e.g. nick.thomas.gitlab.io !)
Anything that we store on disk, we will need metadata to be stored externally. So, the path that will be used is not that really important, as we can slowly transition from project path to uuid based path if needed.
@nick.thomas for the namespace colision, there is only "one way" to fix this and have subgroups: any existing namespace / pathes with . change to - or _ while making the URL slug and use . only for the subgrouping thing
What are we going to do with backups and the new storage format? Should we version backups somehow (don't know if they already have a "backup version" manifest) and should we start generating backup tarballs on the new format (the one which uses only UUIDs?)
This is a explorative question we should probably open an issue after getting some input
IMHO it makes sense also to keep backups in the new format so there is a bigger chance backups will be "durable" for longer, as of today getting a backup from a namespace/repository from an year ago may not even be the same "project" as things can have been moved around, so using UUIDs you always have the backup for the right thing.
Yeah, it's a good question. We probably should consider exporting a CSV file with project names mapping to the pathnames and including that in the tarball.
Sounds like UUID[0..3] will be sufficient for a while.
@brodock could we split in 2 folders? like UUID[0..1]/UUID[2..3]
This is because having the first 4 chars would end up in 65k folders at the top level.
This can end up being a performance hog at the FS level, and it makes it big enough to be a trap if we are trying to manually walk the path.
Besides that, ext3 only supports up to 32k folders inside a folder, so this would leave us in the same bad position as before (not that I'm advocating for using ext3, but you know, customers...)
Besides that, I agree with the proposed distribution.
@brodock asked me about how to deal with UUID data types in PostgreSQL/MySQL so I'll weigh in a bit.
Why are we proposing the use of UUIDs instead of just regular IDs? Using IDs we have the same benefits as UUIDs (= not using project names as directory names and such), but we don't have to deal with things such as generating the UUIDs and storing them in an efficient way that is supported for both MySQL and PostgreSQL. Further, renaming projects and such is supported out of the box since the IDs don't change.
Looking through the discussion there seems to be a lot of complexity involved in getting unique UUID fragments and what not, when IDs would solve all these problems as far as I can tell. Unless I missed something it feels like we're over-engineering things.
Besides that, ext3 only supports up to 32k folders inside a folder, so this would leave us in the same bad position as before (not that I'm advocating for using ext3, but you know, customers...)
IIRC ext3 is (being) removed from the Linux kernel drivers list, so I'd argue we should just not support ext3 at all (or similar file systems with limits on the number of directories/files).
Using the integer ID approach we can still spread directories around, using the same technique as UUIDs. Since IDs are incremental this should lead to a fairly consistent distribution depending on the prefix length.
Using the integer ID approach we can still spread directories around, using the same technique as UUIDs. Since IDs are incremental this should lead to a fairly consistent distribution depending on the prefix length.
I thought the UUID scheme was actually simpler for two reasons:
We don't have to ensure a monotonically increasing number (and avoid accidental collisions) by creating the project entry in the DB before we create the directory.
Sharding is a simple matter of dividing up the names, which LFS and git do already. There was a scheme proposed in https://gitlab.com/gitlab-org/gitlab-ce/issues/28112 that might work, but sharding seems a bit more complicated there.
With an integer, perhaps not use the "prefix", but rather some sort of modulo-function (see https://en.wikipedia.org/wiki/Benford%27s_law for a reason about why 30% of leading digits are "1").
We don't have to ensure a monotonically increasing number (and avoid accidental collisions) by creating the project entry in the DB before we create the directory.
Perhaps I'm missing something, but what is the point of creating repository/file system data before creating a row in the database? I'm specifically referring to the re-use of the primary key ID, not something like our iid values. This means we don't need to perform any additional work, it's already there.
Perhaps I'm missing something, but what is the point of creating repository/file system data before creating a row in the database?
Please correct me if I'm wrong, but if we look at this code:
defsave_project_and_import_data(import_data)Project.transactiondo@project.create_or_update_import_data(data: import_data[:data],credentials: import_data[:credentials])ifimport_dataif@project.save&&!@project.import?raise'Failed to create repository'unless@project.create_repositoryendendend
I can see a case where the create_repository occurs before the transaction is committed to the database and the transaction is rolled back. It looks like PostgreSQL auto increments the primary ID even if the transaction is not committed (https://www.postgresql.org/message-id/501B1494.9040502@ringerc.id.au), so perhaps this is not an issue with PostgreSQL. I think this is true for MySQL as well.
So I suppose it's not a real issue. But the sharding question still remains. Do we want a balanced directory tree? How would we do that with the primary ID approach?
IMHO to balance with primary ID we would have to use a hash function which we agree not to do.
I've searched for a little bit and it looks like internally postgres store the UUID as "bytes" and it coverts back and forth for the representation format. We could emulate the same for MySQL, converting to the binary form while storing and back to the representation format while loading. This reduces the storage amount significantly in comparison with storing as char/varchar.
So it can be stored in a "binary(16)" vs a "varchar(36)".
I've jumped a rabbit role to try to make https://github.com/mathieujobin/activerecord-mysql-uuid-column work for Rails 4.x. At the end it seems that there is another one that may work in a similar way, but for Rails 4.x. Rails 5 has support for uuid data-types for MySQL, so we don't need any gem for that anymore.
UUIDs are expensive to store, maintain, and query in a database.
We can use the primary ID and compute a hash (no salting) of the value for the same effect.
To avoid potential collisions with the hash function, use a scheme such as <hash (or subset of it>/<project_id>/<data>. For example: hash{0..1}/hash{2..3}/project_id/<data>
Do we want to include the .git extension, or is that unnecessary?
Advantages:
We already store the ID, so it's not another column to add
It retains the same properties as a UUID approach.
We don't have to load extra bytes for every SELECT * FROM projects
Disadvantages:
The directory mapping is computed in a special way. We can't simply export a CSV of the projects table and see what paths are used.
The paths are guessable if you know the project ID.