Skip to content

Use an alternative to Gitlab::ReferenceCounter for tracking when it's safe to migrate repositories

Currently, we use Gitlab::ReferenceCounter in two contexts:

  • Hashed storage migration
  • Repository storage migration

To enable these two (infrequent) activities, we use Redis to track whenever a git push is started, and ends.

This is fine as far as it goes, but the mechanism isn't perfect - a long-lived push will expire, for instance - and it's a relatively large amount of overhead (tracking every push, even efficiently per-push).

Instead, I think we can lean on Gitaly:

  • Pushes to any one repository are handled by a single Gitaly server
    • (This is true right now; distributed gitaly may complicate things a bit)
  • Introduce an RPC to make the repository read-only from Gitaly's point of view. Perhaps it sets a value in the .git/config file that is consulted in the pre-receive hook
  • Introduce an RPC to count how many git-push-related RPCs are ongoing
    • For the "distributed gitaly" case, we'd need a way to accumulate these numbers
  • (Optional) introduce an RPC to forcibly terminate all git-push-related RPCs

We can now manage hashed storage / repository migration in the following way:

  • Send SetRepositoryReadOnly(true)
  • If in a hurry, send CancelOngoingGitPushes()
  • Wait on CountOngoingGitPushes() until it reaches 0
  • Perform the action
  • Send SetRepositoryReadOnly(false)

This gets us more-reliable push exclusion (since gitaly should always know if a push is ongoing, even if it takes more than 30 minutes, or a post-update hook fails to run, etc) .

It means we no longer have to track git push status in redis for all repositories, all the time. We do have to track it per-gitaly, but this is a matter of counting how many ReceivePack and (possibly) InfoRefs RPCs are in-flight at the moment of asking.

We don't necessarily have to do any additional accounting for this, but if we do, the load of it is spread across all gitaly nodes, rather than across the redis nodes.

It adds the ability to force-terminate ongoing git push sessions if we're in a rush, as might be the case if, e.g., we're evacuating a storage node because of a degraded raid array or something similar.