Concurrent RPCs that delete references can deadlock in Gitaly Cluster
Problem
RPCs that delete references need to acquire the packed-refs.lock
so that no concurrent process will try to update the file while processing the deletion. Unfortunately, this can cause deadlocks when at least two requests are in flight that would delete a reference in Gitaly Cluster:
- Gitaly A receives request 1 first and will acquire the lock. It will then wait for Gitaly B to cast its transactional vote.
- Gitaly B receives request 2 first and will acquire the lock. It will then wait for Gitaly A to cast its transactional vote.
So both Gitalies are waiting for each other and are thus essentially deadlockes. This issue is something that we frequently observe in highly active repositories.
This is problematic in very active repositories because once this deadlock occurs, replication jobs are queued to bring the nodes of the cluster back in sync. For a highly active repo, these replication jobs don't complete before the next change comes in, resulting in a busy spinning type of situation until the activity on the repository is reduced to the point where all nodes can complete their replication queues and be synchronized once again.
Challenges
The team is actively investigating solutions to this. Given the current architecture in Git, it's unlikely we can completely mitigate this for a couple of reasons:
-
There are multiple RPCs that can trigger a need to lock the aquire the
packed-refs.lock
. Unless we strongly enforced a concurrency limit across all such RPCs (creating a large performance bottleneck), contention will occasionally happen. -
What we're seeing is a technical limitation of Git itself. There have been efforts to fix this in upstream Git with a new reference backend, but an agreed upon solution has yet to be found.
Solution
While it is unlikely that we can completely resolve this, we can mitigate this tremendously. Evaluating a very large and active repository, it was found that a delete-refs
storm can occur. This is caused by the internal refs
that are created as as part of merge requests, merge trains and pipelines (described in Workaround Gitaly deadlock when deleting refs b... (#5369 - closed)).
One option on the Gitaly side is to enforce a concurrency limit of one for DeleteRefs
. While this is not perfect and other mutating RPCs can occur to cause the nodes to come out of sync, it should eliminate the primary cause of deadlocks.
Therefore, this is the currently proposed MVC.