Skip to content

Add repacking support to transaction manager

For Repack objects within a transaction (#5584 - closed)

Gitaly has a sophisticated housekeeping system. That system packs loose objects, re-organizes the on-disk layout of packfiles, prunes unreachable objects, etc. It aims to make the repository optimal, performant, and cost-effective. It's a crucial component for Gitaly.

The current housekeeping approach is working on the repository concurrently with the TransactionManager. This is not okay as the TransactionManager is expected to be the single writer in the repository. We'll thus need a different method for repacking objects of a repository to synchronize it with all other access.

The WAL manager has a very different way of handling concurrent requests. As a result, the repacking task should adapt the new architecture accordingly. The manager handles a repacking task in three stages: preparation, verification, and applying.

When a transaction is committed, the goroutine of the transaction runs repacking preparation. This stage triggers git-repack(1) command with different parameters depending on the desired strategy. Afterward, it attaches the list of new files and a list of deleted files to the transaction. This stage can span multiple minutes/hours. While it runs, the manager can accept other update transactions.

When the preparation stage finishes, the repacking transaction is submitted to the manager and the verification is performed. This verification is head-of-line blocking for each repository. The manager verifies if the repacking task causes any conflict with other transactions accepted beforehand. There are two types of conflicts:

  • Another transaction points to new references to pruned objects.
  • Another transaction includes a change that depends on pruned objects.

Both cases require examining the list of committed transactions since the time the repacking task started. The manager collects the reference tips and verifies if they are still accessible from the repository or in any new packfiles produced by other transactions. The dependency check is not supported now.

In the future, when Git supports extracting relevant objects of a pruned object, we can resolve conflicts smarter. At present, the manager rejects the repacking task if it finds any conflict. If the task is good to go, the manager appends the WAL log entry.

Finally, the corresponding log entry is applied. The manager removes redundant packfiles and links new ones. If there are any concurrent transactions that introduce file changes, their resulting packfiles are located next to the repacked one(s).

At this stage, we don't want to modify the housekeeping scheduler. The scheduler decides when and how a housekeeping task should run on a repository. It has different repacking strategies depending on the repository situation. The manager handles those strategies accordingly. There are 4 of them now:

  • IncrementalWithUnreachable: this strategy packs unreachable objects into a single packfile. In the WAL transaction, all changes are packed by default. So, this strategy is a no-op.
  • Geometric: this strategy rearranges the list of packfiles according to a geometric progression without taking reachability into account. It doesn't prune objects either.
  • FullWithUnreachable: this strategy merges all packfiles into a single packfile, simultaneously removing any loose objects. Unreachable objects are then appended to the end of this unified packfile.
  • FullWithCruft: In traditional housekeeping, the manager gets rid of unreachable objects via full repacking with cruft. It pushes all unreachable objects to a cruft packfile and keeps track of each object mtimes. All unreachable objects exceeding a grace period are cleaned up. The grace period is to ensure the housekeeping doesn't delete a to-be-reachable object accidentally. In WAL, it's feasible to examine the list of applied transactions. As a result, we don't need to take object expiry or cruft pack into account. This operation triggers a normal full repack without cruft packing. We keep the same strategy name for backward compatibility.

Those strategies have increasing costs as well as corresponding effects. The lower-cost ones will be triggered more frequently. Only the last strategy involves object pruning. Others are safe for concurrency.

Merge request reports