Use an alternative to Gitlab::ReferenceCounter for tracking when it's safe to migrate repositories
Currently, we use Gitlab::ReferenceCounter
in two contexts:
- Hashed storage migration
- Repository storage migration
To enable these two (infrequent) activities, we use Redis to track whenever a git push
is started, and ends.
This is fine as far as it goes, but the mechanism isn't perfect - a long-lived push will expire, for instance - and it's a relatively large amount of overhead (tracking every push, even efficiently per-push).
Instead, I think we can lean on Gitaly:
- Pushes to any one repository are handled by a single Gitaly server
- (This is true right now; distributed gitaly may complicate things a bit)
- Introduce an RPC to make the repository read-only from Gitaly's point of view. Perhaps it sets a value in the
.git/config
file that is consulted in thepre-receive
hook - Introduce an RPC to count how many git-push-related RPCs are ongoing
- For the "distributed gitaly" case, we'd need a way to accumulate these numbers
- (Optional) introduce an RPC to forcibly terminate all git-push-related RPCs
We can now manage hashed storage / repository migration in the following way:
- Send
SetRepositoryReadOnly(true)
- If in a hurry, send
CancelOngoingGitPushes()
- Wait on
CountOngoingGitPushes()
until it reaches 0 - Perform the action
- Send
SetRepositoryReadOnly(false)
This gets us more-reliable push exclusion (since gitaly should always know if a push is ongoing, even if it takes more than 30 minutes, or a post-update hook fails to run, etc) .
It means we no longer have to track git push
status in redis for all repositories, all the time. We do have to track it per-gitaly, but this is a matter of counting how many ReceivePack
and (possibly) InfoRefs
RPCs are in-flight at the moment of asking.
We don't necessarily have to do any additional accounting for this, but if we do, the load of it is spread across all gitaly nodes, rather than across the redis nodes.
It adds the ability to force-terminate ongoing git push
sessions if we're in a rush, as might be the case if, e.g., we're evacuating a storage node because of a degraded raid array or something similar.