Use cruft packs to exclude unreachable objects from repository size calculations
We still have not gotten to a point where we can iterate on the repository size calculations due to various inefficiencies and how those calls are scheduled by Rails. But while there is not yet a fix in sight for the architectural deficiencies around this, customers frequently hit the issue that their repository size does not decrease even though they rewrote their repository. In fact, in many cases a rewrite of the repository to not have certain blobs anymore would at first lead to a size increase rather than a size decrease.
The root cause of this is that we have a grace period of two weeks during which we retain unreachable objects in order to avoid races in Git that otherwise may lead to data loss and repository corruption. We cannot remove this grace period, but the end result is that the unreachable and now loose objects will continue to exist in the repository and be accounted towards the repository size. It is not reasonable to hold customers accountable for these objects as it is an internal implementation detail that we need to retain them for an extended period of time.
In %15.10 though we are about to introduce cruft packs. Instead of exploding unreachable objects into lose objects, they will instead be packed into a packfile that is marked with a .mtimes
file. This mtimes file identifies the last time any object in that cruft pack has been accessed so that Git does not have to freshen the whole packfile in case a single of these objects is becoming referenced again. This means that it is more efficient to store unreachable objects during the grace period as we can continue to make use of compression.
Another benefit this has is that we can improve our naive repository size calculations that simply take the whole repository's size. So instead of returning the complete repository's size, we will subtract the size of cruft packs that indicate unreachable objects. This should be a much better estimation of the actually-reachable objects' size in a repository and thus is likely to fix many of the complaints we have seen as a customer would not have to wait for two weeks anymore until the repo size reduction is reflected.
Note: I do not consider this to be a good solution to the overall problem, but rather as an intermediate step until we have finally fixed the architecture to properly report repository sizes.