Consider Repository size improvements
Summary
@pks-gitlab
recently gave a presentation on Gitaly and Repository sizes (slides, recording) and how they're calculated.
We have three RPCs available with different pros/cons (see slide 19), but in short:
-
RepositorySize
- old and not to be used -
RepositoryInfo
- better thanRepositorySize
, but still has downsides such as occasionally doubling size during maintenance -
ObjectsSize
- includes object pools and is far more accurate, but very slow to compute and should be used sparingly
Problem
We currently (as of %16.3) use RepositoryInfo#recent_size
for a Project's repository size and for display to a customer (i.e. stored in project_statistics.repository_size
).
The size is that of recent objects and excludes stale/unreachable objects, but it also does not include object pools and by itself is not all that useful to us (for billing) or our customers, because it's not giving an accurate representation of the actual storage size of a repository.
We switched to using this because it was unfair to our Customers to enforce storage limits for consumption of repository storage that they have no control over.
But for GitLab, this means that we're likely under-charging for storage because we will now be using a much smaller repository size than is actually being consumed.
Proposal
With the variety of RPCs made available by the Gitaly team, we can make some improvements for both GitLab and our Customers, primarily:
- using all of the information available from the
RepositoryInfo
and make it available to the customer, this will give them a much better overview of their storage usage, broken down by type (recent objects, stale objects, total, etc.)- store the entire set of data in a single column (JSONB) or multiple, on the project statistics table
- refresh the data as frequently as it currently is
- make the data available in the UI and APIs
- Switch to using the
ObjectsSize
RPC for consumption/billing purposes- perform an
ObjectsSize
repository size check at most once per day and store it in the DB (project_statistics.repository_size
or another new column if we'd rather implement this in parallel with our current implementation) - ensure "Recalculate repository usage" feature uses the
ObjectsSize
repository size to update👆🏽 - update enforcement/limit rules to use this size instead
- perform an
Outcome
- Customers will have a more detailed view into their repository storage consumption and hopefully have less confusion. In turn, this will hopefully improve confidence in the product/statistics and reduce support requests.
- We (GitLab) will have a more accurate statistic to check against when determining a namespace's repository storage consumption, which should lead to an increase in purchases of additional storage or plan upgrades and a reduction in cost as customers reduce storage consumption to conform to limits