Praefect replication errors for repositories that no longer exist
Overview
A corrective action has been identified in that the Gitaly team will me moving toward retiring the replication queue.
We have also identified that this issue has no risk of cascading failures.
We are choosing to leave this issue open until we validate that the replacement for replication queues solves this problem, but are marking this as a priority4 / severity4 since the resolution is not being tracked under this issue and we would like to ensure this issue does not raise unnecessary flags while we implement the fix.
Summary
corrective action For gitlab-com/gl-infra/production#6133 (closed), there was a situation where a single Praefect node was offline for an extended period of time. When this was resolved, replication resumed we saw a large number of errors for repositories that were deleted:
-
choose replica path: get replica path: repository not found
: rpc error: code = NotFound desc = GetRepoPath: not a git repository: "/var/opt/gitlab/git-data/repositories/@hashed/b5/b0/b5b011f2c5f1914ff16fa2d47422379bd97eb3a333dab4fc621281f3ea2259a4.wiki.git"
These should probably not be errors if these deletions are expected, @pks-t comments on Slack
I'm not sure whether we delete replication jobs properly in case a repository is deleted, but if not then it would also explain why we see so many failures.
In theory we would also only retry those replication jobs a finite amount of times and eventually purge them from the queue.
Recommendation
We should understand better if these errors are expected, if they are we can probably handle this better
Verification
I think we can verify this by catching up a Praefect replica after deleting projects.