git/housekeeping: Introduce geometric repacking strategy
One of the most important tasks performed by our repository housekeeping is repacking objects into packfiles such that they can be efficiently accessed. Unfortunately, this is also the most expensive part in general as it is both CPU- and IOPS-intensive. So even though it would be most efficient to always pack all objects into a single packfile, this is not feasible especially in the context of large repositories.
We are thus using a hybrid approach right now where we alternate between incremmental repacks that soak up all loose objects, and full repacks that repack all packfiles into a single packfile once there are too many of them. In highly active repositories this strategy is too inefficient though as we accumulate incremental packfiles fast and then need to do the full repack quite regularly. And just bumping the limits of how many packfiles are allowed until the next full repack is not ideal either, as the repository will become less efficient to serve the more packfiles there are.
To improve upon this usecase, Git v2.33.0 has introduced support for geometric repacking that is a tradeoff between incremental and full repacks: instead of repacking either only loose objects or everything at once, a geometric repack will arrange packfiles so that the end result forms a geometric sequence regarding the amount of objects contained in each of the packs. Like this, a geometric repack would typically only rewrite a small slice of packfiles.
In the best case, we'd shift to use geometric repacks exclusively in order to reduce the overhead of repository maintenance. But there are two reasons why we cannot do so:
- Geometric repacks don't take reachability into account and will
always include all loose objects. As we need to expire objects
over time it means that we have to do a full repack with cruft
packs every now and then to evict unreachable objects.
- Geometric repacks don't take delta islands into account. We use
delta islands to make sure that deltas are only created against
objects which are part of the default refspec used by clients,
namely branches and tags. We thus need to regularly do a full
repack that "freshens" our delta islands.
We're thus forced to keep alternating between full and geometric repacks. But we can do the full repacks a whole lot less frequently than we used to do them before. As we don't have control over the number of packfiles anymore (it's controlled by the geometric sequence now), we instead need to use a different heuristic to decide whether to do one or the other. For now, we are going with time-based repacks where we only do a full repack in case a certain amount of time has passed. This effectively puts a rate-limit in place.
Implement this new strategy behind a feature flag.
Closes #4998 (closed).
Merge request reports
Activity
changed milestone to %15.11
assigned to @pks-gitlab
1 Warning This merge request is quite big (704 lines changed), please consider splitting it into multiple merge requests. Reviewer roulette
Changes that require review have been detected! A merge request is normally reviewed by both a reviewer and a maintainer in its primary category and by a maintainer in all other categories.
To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, mention them as you normally would! Danger does not automatically notify them for you.
Category Reviewer Maintainer ~ Justin Tobler (
@justintobler
) (UTC-5, 7 hours behind@pks-gitlab
)Pavlo Strokov (
@8bitlife
) (UTC+3, 1 hour ahead of@pks-gitlab
)If needed, you can retry the
danger-review
job that generated this comment.Generated by
Dangerrequested review from @knayakgl and @qmnguyen0711
removed review request for @knayakgl
added 30 commits
-
083c9dfd...b8709da8 - 26 commits from branch
master
- 40c0f382 - git/stats: Convert multi-pack-index packfile count to uint64
- 46a31c5b - git/housekeeping: Add strategy to repack with unreachable objetcs
- 9d13e347 - git/housekeeping: Deduplicate common repack configuration
- 12bde1a9 - git/housekeeping: Introduce geometric repacking strategy
Toggle commit list-
083c9dfd...b8709da8 - 26 commits from branch
62 63 WriteMultiPackIndex: true, 63 64 } 64 65 66 if featureflag.GeometricRepacking.IsEnabled(ctx) { 67 nonCruftPackfilesCount := s.info.Packfiles.Count - s.info.Packfiles.CruftCount 68 69 // It is mandatory for us that we perform regular full repacks in repositories so 70 // that we can evict objects which are unreachable into a separate cruft pack. So in 71 // the case where we have more than one non-cruft packfiles and the time since our 72 // last full repack is longer than the grace period we'll perform a full repack. 73 // 74 // This heuristic is simple on purpose: customers care about when objects will be 75 // declared as unreachable and when the pruning grace period starts as it impacts 76 // usage quotas. So with this simple policy we can tell customers that we evict and 77 // expire unreachable objects once per day. Thanks, @pks-gitlab. The MR looks neat and elegant. It adds a lot of contexts and reasons behind the heuristics of picking strategy. I have one nit, but it doesn't matter anyway. I'll approve and set MWPS.
Although I involved in the reviews of housekeeping for a while, I feel like missing the big picture. I wonder if you can spend sometimes summarizing the direction of the housekeeping in
doc/
, especially the reasons why we move toward the current direction. Something like this MR's description is super helpful.Although I involved in the reviews of housekeeping for a while, I feel like missing the big picture. I wonder if you can spend sometimes summarizing the direction of the housekeeping in
doc/
, especially the reasons why we move toward the current direction. Something like this MR's description is super helpful.@qmnguyen0711 I'm kind of torn on this, to be honest. I have recently written the blog post about repository maintenance which explains what the issues were with the old strategy. But it's rather a snapshot in time, and things have been evolving quite a lot in that context. And furthermore, we have the housekeeping documentation that is customer-facing which doesn't go into the nitty-gritty details, because ultimately customers should not need to care.
What is still missing is an overview that details how exactly the current strategy looks like and why it looks that way. But the issue I have with this is that it would grow outdated quite fast if one forgot to update it. So instead, I have been trying to be as verbose as possible with the in-code comments to explain all the different choices I have been doing there. It's easier to update that documentation while at it.
I'm not quite sure what exactly it is that you're searching for. Is it just an overall picture of the complete housekeeping strategy that we're using by now? If so, I could definitely try to implement something like that, but I'd rather want to do it as in-code documentation.
What is still missing is an overview that details how exactly the current strategy looks like and why it looks that way. But the issue I have with this is that it would grow outdated quite fast if one forgot to update it. So instead, I have been trying to be as verbose as possible with the in-code comments to explain all the different choices I have been doing there. It's easier to update that documentation while at it.
Yeah, I agree. I guess it's a normal mental state where an area moves fast and changes fast. I really appreciate your effort when you spent a lot of time on in-code documentations, as well as commit messages. Combining commits and codes, I think I can get the full vision and reasons behind each change. Sorry for bragging about this
@qmnguyen0711 There really is no need to be sorry -- it is a valid concern if it is hard to understand the bigger picture, and I would love to improve the situation if required.
As the driving force behind this kind of work that spans over many releases one tends to have a clear picture of what one is aiming for, up to the point where things are "obvious". "Obvious" of course typically only means obvious to oneself as compared to obvious for other involved parties, and this is something that is easy to miss.
Also, even if I wrote a bunch of documentation, it does not mean that other involved parties actually know that this documentation even exists in the first place. They weren't part of all the steps, so it's on me if I didn't clearly communicate where to find them.
So again, no need to be apologize in any way. Let me know if you think it would help to have documentation of the overall strategy and the motivations in our code and I'll happily provide it.
mentioned in commit 92f31e24
mentioned in merge request !5510 (merged)
added workflowstaging-canary label
added workflowcanary label and removed workflowstaging-canary label
added workflowstaging label and removed workflowcanary label
added workflowproduction label and removed workflowstaging label
mentioned in merge request !5607 (merged)
added releasedcandidate label
added releasedpublished label and removed releasedcandidate label