Workaround Gitaly deadlock when deleting refs by GitLab
Problem
GitLab does maintain set of internal references related to: merge requests, merge trains and pipelines.
Those refs are used to provide stable anchor points for a given commits. Since those refs are managed by GitLab they are also recycled when some events do occur:
- pipeline does finish: pipeline ref is deleted
"refs/#{Repository::REF_PIPELINES}/#{pipeline.id}"
- merge request is closed: merge request ref is deleted
- merge train car is processed: merge train ref is deleted
For a high-frequency repositories: a lot of concurrent deletes do happen, for example when many concurrent pipelines are run this does trigger the problem described by: #5368 (closed).
When this happens Gitaly effectively looses replication state and goes to use slow ReplicateRepository
to recover state. Now, since the repo has high-frequency of changes this result in a busy loop since we cannot
replicate quickly enough, the another ReplicateRepository
is scheduled, and as result CI prefers to use primary
basically all the time.
In case of one customer, this is primary problem related with high frequency of CI pipelines creation
and the need to cleanup Ci::PersistentRef
after pipelines are done.
Solutions
This describes possible solutions for the stated problem where 1. is most favorable, 2. is the workaround.
1. Fix Gitaly
Preferably solve the underlying problem via: #5368 (closed). However, this is not trivial task and will take some effort.
repository_refs_to_delete
table
2. Introduce - In all cases where we call
repository.delete_refs
(gRPC
method forDeleteRefs
). We would enqueue ref to delete in arepository_refs_to_delete
. We would schedule Sidekiq job in the future (15 minutes from now). - The Sidekiq job would be scoped to the project and delete up-to 100 refs at given time.
- The Sidekiq job would be de-duplicated, and guaranteed to be run only once at a given time to avoid Gitaly concurrency bug.
Likely we need ci_repository_refs_to_delete
(in CI database) and repository_refs_to_delete
(in main database)
::Ci::PipelineCleanupRefWorker
worker
3. Introduce - When trying to delete ref, we would schedule
::Ci::PipelineCleanupRefWorker
instead. - This would ensure that only a single deletion is happening at a given time by
ExclusiveLock
. - It would be retried for limited amount of time.
Remark
- The above mentioned in 2. should affect only internally managed refs (merge trains, merge requests, ci pipelines), as we are OK with delaying removal of them.
- The user made changes (remove tag, or branch) should be immediate (as they are today) even if those might trigger the problem.
- Doing 2. can help overall anyway to reduce load induced on Gitaly by GitLab Rails even once Gitaly is mitigated.