Skip to content

gitaly/hook: Fix packed-refs lock contention by synchronizing hooks

In highly-active repositories we frequently see lock contention around the packed-refs file, which needs to be locked whenever a reference is deleted. This typically manifests in the DeleteRefs RPC with the following error message:

fatal: prepare: Unable to create '$REPO_PATH/packed-refs.lock': File exists.

We tried to work around this contention by increasing the timeout for acquiring the lock to 10 seconds in b0a54103 (git: Extend locking timeout for packed-refs to decrease contention, 2023-05-24). Unfortunately this change did not help, but mostly caused the blocked RPCs to take longer to fail. It's thus clear that we need to take a step back and check why this lock might be held for 10 seconds or longer, as this indicates an architectural issue.

And indeed, the root cause of this is our custom hooks. In Gitaly Cluster, we make sure that the hook logic is only executed on the primary node so that we don't needlessly duplicate any of the work performed by the hook. But because secondaries skip executing the hook logic, the consequence is that they will forge ahead executing the rest of the RPC logic while the primary is still handling the hook.

Executing hooks may take quite a long time especially in large repositories though, primarily because we also need to invoke Rails' /internal/allowed endpoint in order to authenticate the changes. On large repositories like for example our own gitlab-org/gitlab, we see them frequently exceeding dozens of seconds.

Ultimately, the consequence is that the secondaries will wait for the primary node to catch up on their next transactional vote. But in all of our RPCs the next transactional vote happens after we have already locked references, so we are essentially blocking all other RPCs from accessing these references even though we are blocked waiting for the primary. And given that the primary may take dozens of seconds to catch up this neatly explains the root cause for the observed lock contention.

Fix this bug by synchronizing hook execution across primary and secondary nodes. This will cause the secondaries to wait before acquiring any locks until the primary has finished executing the hooks and thus greatly reduces the amount of time spent in the critical section on secondaries.

Closes Investigate whether we can avoid locking `packe... (#5353 - closed).

Edited by Pavlo Strokov

Merge request reports