🐘 Git for enormous repositories with partial clone and sparse checkout (#773) · Epics · GitLab.org

🐘 Git for enormous repositories with partial clone and sparse checkout

*This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.*  ## Vision Enormous repositories that have traditionally only been possible to store in centralized version control system like CVS, SVN or Perforce should work well in Git and GitLab. Be the repository large because of binary assets or huge project with a long history, it should reliably work well in GitLab. This requires improvements to Git and to GitLab, specifically improving support for partial clone. ## Problem to solve As Git repositories become larger, they become more challenging to use. GitLab will make it easy to work with large repositories, so that new projects can migrate to Git and that existing repos become more enjoyable to user. GitLab will do this by adding support for partial clone, so that clones can be: - filtered by path to help teams with large monorepos - filtered by blob size to help team that work with large files like binary assets | **Mockup:** filter clone by path | **Mockup:** filter clone by blob size | | --- | --- | | ![Artboard_20190403_12408](/uploads/f8fdca5e50ef99b2e9e647b2fa142825/Artboard_20190403_12408.png) | ![Artboard_20190403_12707](/uploads/565e5920da558e01329776d718f36540/Artboard_20190403_12707.png) | ## :rocket: Status - :white_check_mark: Partial Clone enabled by default from [GitLab 13.0](https://about.gitlab.com/releases/2020/05/22/gitlab-13-0-released/#exclude-large-files-using-partial-clone) - :art: Binary file locking workflows supported via the Git LFS client (files do not need to be stored in Git LFS) (improved docs in progress https://gitlab.com/gitlab-org/gitlab/-/merge_requests/39133) ## Further details Git repositories can become large in a number of ways, but the most pressing problems are: - repositories that are large because of the extremely large amount of files and history - repositories that are large because they contain large binary assets like images Both these problems can be improved by reducing the amount of data that Git downloads to the subset of data needed by the user. Additionally, other factors that influence git performance are: <details><summary>Too many pushes</summary> When pushing commits to the Git server, new objects are uploaded, and then the branch ref is updated to point to the latest commit. Large numbers of users trying to write to the same branch simultaneously will create a race where developers are racing each other to pull the latest changes and push theirs. - **simultaneous pushes:** Merge requests provide a way for changes to be made and merged in a more organized fashion, rather than pushing directly to a common shared branch. - :white\_check\_mark: [solved by Merge request workflow](https://docs.gitlab.com/ee/user/project/merge_requests/index.html) - **merge contention:** In large high velocity projects merge requests are not a complete solution, and contention problems return as more and more people try to merge in a short period of time. This is exacerbated by CI pipelines that may take tens of minutes to run, and should be re-tested every time the target branch changes. - :warning: Merge queue https://gitlab.com/gitlab-org/gitlab-ee/issues/9186 </details> <details><summary>Too many branches (refs)</summary> Refs are an important component of Git internals (see https://git-scm.com/book/en/v1/Git-Internals-Git-References) and act as pointers to different commits. Branches and tags are the most well known refs, but refs are used for other purposes too. GitLab creates refs for merge requests and commits with diff discussions so that they won't be garbage collected and continue to be accessible. - **ref advertisement:** The most significant problem caused by too many refs is slow `git-fetch` and `git-push` due to ref advertisement which is part of the Git protocol. The new Git protocol v2 addresses this. - :white\_check\_mark: Git protocol v2: [GitLab 11.4](https://about.gitlab.com/2018/10/22/gitlab-11-4-released/#git-protocol-v2) | [Spec](https://github.com/git/git/blob/master/Documentation/technical/protocol-v2.txt) | [Google Blog](https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html)) </details> <details><summary>Too much history (commits or large objects)</summary> As a DCVS, cloning a Git repository downloads not only the latest versions of files, but the history of the repository so that you can work offline and understand how source files change over time. A consequence of this is that all this data needs to be downloaded when cloning a repository, and all new changes need to be downloaded when fetching updates. - **binary files:** Adding binaries significantly increases the repo size. Added to any branch binary assets impact all users and new versions of binary assets cannot not stored efficiently like text. - :white\_check\_mark: Use Git LFS and reduce repo size with BFG: [GitLab Docs](https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html#using-the-bfg-repo-cleaner) | [BFG Repo Cleaner](https://rtyley.github.io/bfg-repo-cleaner/) - :bulb: Native large file support https://gitlab.com/groups/gitlab-org/-/epics/958 and let Git take care of it - **GitLab housekeeping operations:** Tasks like checking if an LFS object is still needed requires scanning every single commit in the repository. As the size of the repository grows, these kinds of operations become very expensive. - :white\_check\_mark: `git-commit-graph`: [Git Docs](https://www.git-scm.com/docs/git-commit-graph) | [Git Rev 45](https://git.github.io/rev_news/2018/11/21/edition-45/) | [MSFT blog](https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/)) </details> <details><summary>Too many files (e.g. monolithic repository)</summary> Git uses an index to track the project files that should be committed. For large projects the index may contain a very large number files – Windows has ~3M, Linux has ~65K. Many Git operations must first read the index, which is very large with millions of files. Additionally, important operations mush then also compare each file in the index with the corresponding Git object to determine if the file has changed – a very IO intensive operation. - **slow `git-clone`:** A large number of files, combined with a long history, results in a large amount of data needing to be transferred. Transferring large quantities of data can have a varying impact depending on the situation – developer close to the server, developer with lower quality connection, CI system which must clone a specific state before tests can be run. - :bulb: Partial Clone: [Spec](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt) - :x: VFS for Git: [Official Site](https://github.com/Microsoft/VFSForGit) - **slow `git-checkout`:** Changing branch can require a very large number of files to be changed which is an IO heavy operation. This can making checkout very slow. - :white\_check\_mark: Sparse checkout https://git-scm.com/docs/git-read-tree#_sparse_checkout - **slow `git-status`:** A command that is run frequently, each file must be compared with the Git object. This is IO intensive requiring every file to scanned. - :white\_check\_mark: `git-commit-graph`: [Git Docs](https://www.git-scm.com/docs/git-commit-graph) | [Git Rev 45](https://git.github.io/rev_news/2018/11/21/edition-45/) | [MSFT blog](https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/) </details> ## Up next/In progress - :package: Large file storage https://gitlab.com/groups/gitlab-org/-/epics/1487 - HTTP driver to allow large binary files to be offloaded to dumb object storage ## Customers - https://gitlab.my.salesforce.com/0016100000W3BBg ~customer - https://gitlab.my.salesforce.com/00161000004yLEy ~"customer+" trying to migrate remaining data off P4 - https://gitlab.my.salesforce.com/00161000004bZPD ~"customer+" trying to migrate remaining data off P4 - https://gitlab.my.salesforce.com/0016100000NmU19 ~customer trying to migrate remaining data off multi-terrabyte P4 depot - https://gitlab.my.salesforce.com/00161000004zrG3 ~customer the customer has a large CVS repository where the initial clone is slow (>30 mins) - https://gitlab.my.salesforce.com/0016100000fDO7w ~customer with many large repos - https://gitlab.my.salesforce.com/00161000006g08Q ~customer ~"needs investigation" - https://gitlab.my.salesforce.com/0016100000AYw26 ~customer trying to migrate remaining data from various tools to Git - https://gitlab.my.salesforce.com/00161000003RH62 prospect, wants to migrate to Git - https://gitlab.my.salesforce.com/00161000004zrCF ~customer interested in near-future state - https://gitlab.my.salesforce.com/0016100000sPtsL ~customer - https://gitlab.my.salesforce.com/0016100001FS8wu ~customer many large repositories ## References / links - https://docs.microsoft.com/en-us/azure/devops/learn/git/git-at-scale

epic