Change repository size calculations to provide mechanism instead of policy
In the past, we've had a lot of discussion revolving around repository size calculations. Part of the problem is that the actual size of a repository is not clearly defined and can change based on various different definitions. Definitions we had until now:
- Complete repository's size, taking into account all data structures.
- Only the object database.
- Only the object database, but excluding cruft packs as they are about to be pruned anyway.
- Only reachable objects calculated via git-rev-list(1).
- Only reachable objects calculated via git-rev-list(1), but taking into account object pools.
- Only reachable objects calculated via git-rev-list(1), but excluding internal references like
refs/keep-around/
.
Many of these definitions make sense in some contexts, but not in others. So no matter what definition we arrive at, it's never going to be perfect. Furthermore, requirements have frequently been changing.
The realization I had today is that we're encoding policy into the RepositorySize()
RPC, and that does not make a lot of sense. Why should it be Gitaly that decides what the repository size is when the client knows better about the context? When deciding whether to move a repository to a different Gitaly node it might be preferable to use the complete repository size as metric. On the other hand, when we want to decide the usage quota, we only want to take into account what the user can actually control and thus only report the size of the object database without cruft packs.
We should thus consider to change our approach to this problem. Instead of creating a RepositorySize()
RPC call that encodes policy, we should provide two separate RPC calls that empower the caller to decide actual policy:
- One RPC call to provide detailed repository information. This gives the caller the ability to pick different sizes depending on the usecase at hand. Ideally, we'd both provide "summary values" that summarize a specific class of data (all objects together, all objects waiting to be deleted) as well as "detail values" (the number of packfiles, the total size of packfiles, etc).
- One RPC call to calculate the size of all reachable objects from a set of references via git-rev-list(1). The caller is responsible for excluding internal references like
refs/keep-around/
as dictated by their policy.
This removes all policy from Gitaly and gives callers a better way to iterate. Furthermore, we can use the repository-information RPC call to provide a fine-grained dashboard for repositories that provide better visibility. The nice thing is that we've already built deriving all of these stats into Gitaly in the context of housekeeping anyway.
One risk to be aware of is that clients might start to rely on the actual on-disk state of repositories and how Gitaly optimizes them. That is why I want to discern "summary values" that are a category of objects and "detail values". The former class of values should be generally valid even if implementation details change, and should thus be safe for clients to depend on. The latter class of values cannot ever be guaranteed to be stable.