Reduce Git reference iteration workload on file-cny-01
The main development repo of GitLab itself, gitlab-org/gitlab, is hosted on a dedicated Gitaly server called `file-cny-01`. We have long had vertical scaling challenges with this Gitaly server. While investigating production incidents in December 2020 (e.g. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3161), we discovered that the CPU workload on `file-cny-01` is dominated by our own CI fetch traffic (68% in [this example](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746#note_482627326)). Looking closer at that workload, we noticed that a surprising amount of time (50% of the total non-idle workload) was being spent iterating Git references.
With this epic we want to give an overview of our efforts to address this inefficiency.
The benefits of this are:
1. Vertical scaling challenges on file-cny-01 hinder our own developers
1. The situation that we can only host gitlab-org/gitlab by putting it on a dedicated Gitaly server is problematic. From an infrastructure viewpoint, this one repository should not need special treatment. We want to be able to host repositories like this without special treatment.
## Status 2021-03-16
By looking at data collected during production incidents we discovered two performance problems in Git on repositories with many refs (gitlab-org/gitlab has 500K refs, most of them hidden from users). We discovered and documented an initial workaround: configuring CI to fetch with `--no-tags`. Looking deeper we then found ways to improve the performance of Git itself, which we submitted to the Git mailing list. These improvements got added to Git 2.31.0.
We modified Omnibus and CNG to be able to ship these performance patches ahead of the Git 2.31.0 release. We wrapped up the project by removing the custom `--no-tags` CI setting on gitlab-org/gitlab and gitlab-com/www-gitlab-com because it is better to not have custom settings and rely on Git itself.
The unnecessary server side work eliminated by these changes amounted to about 50% of the Git CPU time on file-cny-01 at the time we found the problem.
epic