Improve the accuracy of repository size calculation

Problem

We are currently utilizing forms of git rev-list --all --objects --disk-usage command to compute repository size for reporting to users through our user interface. While this does provide flexibility in how to accurately report the size of reachable objects and can easily exclude parts that the customer cannot control (such as internal refs), it comes at a significant performance cost. (Full discussion here) As such, we had to revert this change until we could invest non-trivial resources.

Proposal

Instead of delaying further, it makes more sense to come up with an alternate computation option which may be slightly less accurate, but performant immediately. This does not preclude us from iterating in the future back toward the git rev-list approach.

After testing, we believe that the best option is to use git cat-file in batch mode to count all objects. While this is less flexible, it is much more performant.

Release Notes

We have worked hard to improve the displayed repository size by improving how we calculate the size of a repository. This was done to ensure as much transparency as possible. In the past this calculation included shared objects that were included in the pool repository. This update fixes this error in calculation.

While this update is an improvement, it is not perfect. There are still situations where the calculated repository size differs from the actual repository and we are actively working to fix those situations.

https://docs.gitlab.com/ee/user/project/repository/#repository-size

Known tradeoffs

This proposed option has the following downsides:

It always includes all objects regardless of whether or not they're reachable. This could result in a higher size being reported (up to 25%), though for most repositories it will be very close.
It counts refs that are difficult for the user to control such as refs/merge-requests and refs/keep-around though these should be near zero if the commits are never rebased. ~~- If a project is a member of a pool repository, its repository size will include the shared objects that live in the pool repository.~~

Edited Aug 15, 2022 by Mark Wood