Improve repo renames on Geo secondary
(this is a spinoff from #323238 (comment 524532325) and it's surrounding discussion)
Problem
Because of how the repositories are stored in the database by Gitaly Cluster (i.e. Praefect), renames are tricky and might cause failures. On heavily loaded Praefect clusters, this can lead to inconsistencies between what is on disk, and what is in the Praefect database. This requires manual steps to recover.
Geo renames/moves repositories into two directories: @failed-geo-sync
and @geo-temporary
. Failures in moving these repositories can lead to sync failures, which require modifying the filesystem or db to clear.
One example of steps to reproduce
I'm running 3k + 3k ref Geo reference architecture for testing. And I've created a bunch of
gitlab-shell
projects (no forks) with the following script: https://gitlab.com/-/snippets/2149641The Geo secondary is currently in this state:
When I'm looking at the Geo Replication status page I'm seeing these types of errors:
Synchronization failed - Error syncing repository: Temporary repository can not be removed
Synchronization failed - Error syncing repository: Can not move temporary repository to canonical location
Synchronization failed - Error syncing repository: 2:mutator call: route repository mutator: get primary: repository "default"/"@geo-temporary/@hashed/fd/87/fd8751df0d48ed07232aa08e9845c135af55152c8e87d804faf1b8f13716156b.git" not found.
Verification failed - Repository checksum mismatch
Those
verification failed
only have 1 retry, but the other errors all have 400-500 retries and some even up to 1500 retries. (it's running for a few weeks now)I'm not sure yet what the exact cause is, but it seems this method of using
@geo-temporary
repositories and then moving them isn't working well with Gitaly cluster.
Proposal
- Gitaly team is working on gitaly#3832 (closed) and gitaly#3485 (closed). Once either of those are implemented, verify that they prevent running into the consistency issue, or make necessary changes to Geo .
- Implement a command to clean up currently impacted systems using the praefect
list-untracked-repositories
andremove-repository
commands - We should consider documenting some sizing recommendations for Praefect. Using a stock 3k reference architecture, we started seeing the issues on a Geo secondary after we created around 40,000 repositories. Even doubling the CPU/Memory on the database node didn't speed the db up enough to avoid the timeouts.