Create forks of public projects with deduplicated Git objects
GitLab copies the whole repository when forking, which results in lots of redundant objects being stored on the server and makes forking slow. Deduplicating Git objects across forks will reduce the costs of running a GitLab instance with large fork networks (particularly when HA Gitaly support is added at least doubling storage requirements), and significantly improve the performance of forking – a key workflow in open source projects and often used in private organizations too.
### Vision
Forking is the workflow of choice for open source projects, and is used by many private organizations too, because of the flexibility and simplicity making it easy for anyone to contribute without needing to grant everyone write permissions to specific branches. Forking in GitLab should should be first class, feel as fast as creating a branch, and be efficient for instance administrators, so that GitLab can be the best place to host an open source project and private orgs can adopt forking workflows without penalty.
### Background
GitHub developed [delta-islands](https://github.com/git/git/commit/f3504ea3dd21b0a6d38bcd369efa0663cdc05416) as part of their efforts to make forking faster and deduplicate objects across forks, which they describe in [Counting Objects](https://githubengineering.com/counting-objects/). @chriscool worked with Peff at GitHub to upstream this change for use with object deduplication.
@jacobvosmaer\-gitlab investigates using `alternates` or `namespaces` – based on this it was determined that `aleternates` the preferred approach.
### Proposal
> **ATTENTION: Hashed storage is required!** – Object deduplication requires the parent project to be migrated to hashed storage (https://docs.gitlab.com/ee/administration/repository_storage_types.html#how-to-migrate-to-hashed-storage) and for hashed storage to be enabled so that the fork will also be in hashed storage. This is required so that paths do not change breaking the references between the repositories and object pool.
Based in investigation and research conducted in https://gitlab.com/gitlab-org/gitaly/issues/1331
When creating new forks, the parent repository can be added to `objects/info/alternates` so that objects from the parent will be re-used. Most Git operations can be used without any change, limiting the number of Gitaly RPCs that must be created or changed.
The first iteration will benefit new forks, but will not make changes to existing forks. As a fork diverges from the parent with more commits, duplication will increase.
### Weekly demos
| Date | Assignee | Recording | Issues |
|---|---|---|---|
| 2019-04-12 | @jacobvosmaer\-gitlab | TODO | https://gitlab.com/gitlab-org/gitaly/issues/1625 https://gitlab.com/gitlab-org/gitaly/issues/1624 |
| 2019-04-XX | @zj | | |
| 2019-04-XX | | | |
| 2019-04-XX | | | |
### Links / references
- [GitHub Engineering: Counting Objects](https://githubengineering.com/counting-objects/)
- [Delta islands (Git 2.20.0)](https://github.com/git/git/commit/f3504ea3dd21b0a6d38bcd369efa0663cdc05416)
- [`objects/info/alternates`](https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates)
- [Junio: Bringing a bit more sanity to "alternates"?](https://git-blame.blogspot.com/2012/08/bringing-bit-more-sanity-to-alternates.html)
epic