Store merge request diffs in object storage as an alternative to PG
Currently merge_request_diff_commits
and merge_request_diff_files
are two of the largest tables on GitLab.com, weighing in at 288 GB and 740 GB respectively.
This leads to numerous problems, see
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4939
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4853
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4916
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4917
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4920
In a call between @Finotto @smcgivern and myself, we discussed some of the existing proposals, such as https://gitlab.com/gitlab-org/gitlab-ce/issues/37632. While these are good solutions, some of them are difficult to migrate towards, particularly on GitLab.com.
During the call, we discussed an alternative proposal of keeping the existing structure (for now) - ie full diffs, not deduplicated through blobs but conditionally migrating the diffs to object storage, possibly using a git-lfs like scheme in which the diff is replaced in the table with a pointer to a object storage location.
This approach would allow older diffs to be progressively moved over to object storage, allowing newer merge requests to continue to be stored in the database for performance reasons.
It would also make migrating gitlab.com's data much easier, possible via a long-running background migration.
This approach does not preclude future enhancements (such as deduplication) but is a smaller first step which would relieve some of the pain this table is currently causing.