Support for Shallow Forks
Proposal
Background
Git has been around for 10+ years. Companies, Teams, and OSS Projects that have been using Git for a long period of time are hitting the upper bounds of the recommended Git limits for storage. There are also projects that are wanting to leverage a monorepo model to simplify their CI/CD process which also pushes the limits of a Git repository. We are seeing projects 10G, 20G even 500G in size. Forking these repositories for the sole purpose of recommending a change will cause undo IO/CPU on the machines performing the fork operation. When implementing capabilities like Gitaly clusters, the size of this fork chain could easily be in the many TB range that needs to be replicated n times causing unusually high network traffic, making global replication more difficult to ensure eventual consistency.
Example
A quick look at a project like Chromium. The overall size of the repository is ~19G in size. If there are 100 people working on this project that are not core committers, we would need 1,900G or 2Tb of storage for the user forks. We know that users don't delete their forks once their commit(s) have been approved, leaving orphaned forks that continue to be backed up and replicated. These forks also become stale over time so the easy path is for users to re-fork without understanding the consequences.
Suggestion
When a user clicks the fork button, provide a modal dialog box that asks the user to choose either a Full Fork or a Shallow Fork. The Full Fork would work the way it does today where the Shallow Fork is basically a git clone --filter=blob:none --no-checkout ...
. This would reduce the overall size of the clone above from ~19G to ~1.9G. The repository would show no files in the repository so there should be a page similar to the one when you create a new empty project explaining to a user how to leverage commands like sparse-checkout
to limit the amount data that needs to be transferred from the GitLab instance the users workstation. A normal branching workflow would now take place allowing the user to perform an MR back to the parent repository. There are a number of benefits here like leveraging CI for smaller commits allow clones to leverage commands like filter=oid:<branch>
to perform quick clones and faster feedback cycles.
Other options
- Allow the user to choose a branch for the fork and then using a
filter=combine
to group the clone to limit the blobs/oids that are replicated - Allow a user to do this manually on their own machine giving them full control of the options they want to use and then giving them the ability to add a remote to the project allowing them to manually attach a fork
- Allow a user to provide a new branch and do a sparse checkout on the backend for them showing the files and then they leverage tools like the web ide for changes