Ensure UI never calls Gitaly's GetRepositorySize directly
Completed Effort
In gitaly!4430 (merged), the Gitaly team changed the way GetRepositorySize
calculates the disk usage of a repository. Previously, the operation did a simple du -sk
on the .git
directory. While this was efficient and technically gives us the physical disk usage of each repository, it's not accurate from the perspective that the way we store the repository eg: repack or not repacked can cause the size to vary greatly. (see discussion in https://gitlab.com/gitlab-org/gitlab/-/issues/351415)
For this reason, GetRepositorySize
now uses (behind a feature flag) a git operation git rev-list --all --objects --disk-usage
to get the usage of all objects in the git object database. This ends up giving a much more consistent number that more accurately represents the usage of the repository from a user's point of view.
Remaining Effort
With this update in place, the issue with this method is that it's much slower than du -sk
, since it does a full graph walk through the entire object database. On large repositories, this can take several seconds or longer.
We don't want this to impact project page loads. Already in Rails, the call to GetRepositorySize
is cached and stored in a project_statistics
table.
With this new slower but more accurate method of calculating sizes, we need to make sure that the UI never calls GetRepositorySize
directly, but instead always gets its value from the project_statistics
table. In addition to this, we want to minimize the number of times this gets called and limit it to only get called after housekeeping runs for a repository, rather than calling it on every push.
To update the size after each push, instead of calling GetRepositorySize
, we can do this incrementally in Gitlab::GitAccess
since we already check the size of the quarantine directory to see if the push will burst the size limit. We can update the statistics table when we do this check.
Then, after each time housekeeping is called on a repository, we can call ProjectCacheWorker
to update the repo size. This way we limit the expensive GetRepositorySize
call to only when we do housekeeping.
To provide clarity for the rest of the effort associated with this issue, the following discrete updates are needed.
-
call ProjectCacheWorker
to update the statistics table after housekeeping is run. -
Ensure that Rails does not call GetRepositorySize
directly, but instead reads repository size from theproject_statistics
table (which will have the cached value). -
Do not include internal refs eg: refs/merge-requests, refs/keep-around etc. in the repository size calculation -
Exclude RepositorySize from Gitaly appdex calculation, as the latency will go way up due to git-rev-list(1) -
Add an endpoint for the UI to call for on demand repository size recalculation -
Add a button to recalculate repository size -
Remove the feature flag around the updated grpc once we know it's not being called except during a ProjectCacheWorker
call due to performance concerns.
Button in Usage Quotas
Introduce a "Recalculate repository usage" button on the Settings > Usage Quotas
page.
- Button placed on Usage Quotas page
- Clicking on the button will trigger the function GetRepositorySize
- Feedback to the user will be in the form of an informational alert (blue)
- Alert title:
Repository size recalculation started
- Alert content:
Refresh page in a few minutes to view usage.
Usage Quotas with button | Usage Quotas with alert |
---|---|