Geo: Does not mark repositories as missing on primary due to stale cache
Summary
GitLab - and especially Geo - key a number of important decisions off the question of whether a repository exists on disk or not. This is currently done using a Gitaly RPC. The result of that RPC is cached in Rails, using the Redis cache. That cache is sometimes inaccurate.
This was noticed during the GCP Migration, where - without repository verification - it would have resulted in data loss. This is because when there's no repository, the primary returns 404
errors to the secondary. This is taken by the secondary as a signal not to bother trying to sync the data any more (or at least, until another Geo::RepositoryUpdatedEvent
is received, e.g. through a git push
to the primary).
When the cache is wrong, git clients get a 404 error wrongly. This is OK for humans, who will probably just retry. For Geo, it's a data-loss scenario.
Steps to reproduce
?????
Example Project
See https://gitlab.com/gitlab-com/migration/issues/546#note_93849179
What is the current bug behavior?
project.repository.exists?
returns false when project.repository.raw_repository.exists?
returns true
What is the expected correct behavior?
We should always be able to rely on this being true: project.repository.exists? == project.repository.raw_repository.exists?
Output of checks
This bug happens on GitLab.com
Possible fixes
Disable the cache
It's probably not practical, we use this datum in all sorts of places and an RPC is quite expensive compared to a stat
call.
I don't have any good suggestions at present, but I wanted to capture that this happens, and that it can lead to data loss in certain ~Geo cases.