Improve the accuracy of repository size calculation
Problem
We are currently utilizing forms of git rev-list --all --objects --disk-usage
command to compute repository size for reporting to users through our user interface. While this does provide flexibility in how to accurately report the size of reachable objects and can easily exclude parts that the customer cannot control (such as internal refs), it comes at a significant performance cost. (Full discussion here) As such, we had to revert this change until we could invest non-trivial resources.
Proposal
Instead of delaying further, it makes more sense to come up with an alternate computation option which may be slightly less accurate, but performant immediately. This does not preclude us from iterating in the future back toward the git rev-list
approach.
After testing, we believe that the best option is to use git cat-file
in batch mode to count all objects. While this is less flexible, it is much more performant.
Release Notes
We have worked hard to improve the displayed repository size by improving how we calculate the size of a repository. This was done to ensure as much transparency as possible. In the past this calculation included shared objects that were included in the pool repository. This update fixes this error in calculation.
While this update is an improvement, it is not perfect. There are still situations where the calculated repository size differs from the actual repository and we are actively working to fix those situations.
https://docs.gitlab.com/ee/user/project/repository/#repository-size
Known tradeoffs
This proposed option has the following downsides:
- It always includes all objects regardless of whether or not they're reachable. This could result in a higher size being reported (up to 25%), though for most repositories it will be very close.
- It counts refs that are difficult for the user to control such as
refs/merge-requests
andrefs/keep-around
though these should be near zero if the commits are never rebased.- If a project is a member of a pool repository, its repository size will include the shared objects that live in the pool repository.