Geo: Does not mark repositories as missing on primary due to stale cache
GitLab - and especially Geo - key a number of important decisions off the question of whether a repository exists on disk or not. This is currently done using a Gitaly RPC. The result of that RPC is cached in Rails, using the Redis cache. That cache is sometimes inaccurate.
This was noticed during the GCP Migration, where - without repository verification - it would have resulted in data loss. This is because when there's no repository, the primary returns
404 errors to the secondary. This is taken by the secondary as a signal not to bother trying to sync the data any more (or at least, until another
Geo::RepositoryUpdatedEvent is received, e.g. through a
git push to the primary).
When the cache is wrong, git clients get a 404 error wrongly. This is OK for humans, who will probably just retry. For Geo, it's a data-loss scenario.
Steps to reproduce
What is the current bug behavior?
project.repository.exists? returns false when
project.repository.raw_repository.exists? returns true
What is the expected correct behavior?
We should always be able to rely on this being true:
project.repository.exists? == project.repository.raw_repository.exists?
Output of checks
This bug happens on GitLab.com
Disable the cache
It's probably not practical, we use this datum in all sorts of places and an RPC is quite expensive compared to a
I don't have any good suggestions at present, but I wanted to capture that this happens, and that it can lead to data loss in certain ~Geo cases.