Praefect cannot sync repositories properly and under normal use ends up with database record corruption.
## Summary
Praefect lacks the capability of synchronizing with the Gitaly filesystem. If a change occurs on the Gitaly filesystem, Praefect cannot identify this change and update its tracking database. We’ve encountered this issue in other PS engagements, where Praefect/Gitaly keep getting out of sync, and each time that happens the customer is forced to wipe out their cluster.
In Highly-Available clusters, this event can also occur due to latency in the cluster causing a file transfer to fail or a message failing to send/receive.
Another example of this is in an HA Gitlab Cluster with large data, restoring Gitaly backups from snapshots. Praefect and Gitaly Snapshots may not be 100% in sync. To restore the Praefect tracking data, once the snapshot is restored, we have to do the below. Which undermines the point of having a snapshot, since we have to wipe praefect and resync to fix Praefect.
Once a repository is moved or modified, and Praefect is not aware of it, there is no way to rectify this record corruption scenario besides manually modifying the Praefect database, or wiping the Praefect database and rebuilding. Neither solution, modifying the Praefect database nor wiping the Praefect database is a viable solution for many of our customers. A solution to this would be to add four commands to manage praefect.
## Path to Resolution
Below are a series of commands desired to rectify these issues. They are listed in order of importance. The first of which being a command to resync the praefect database with the filesystem.
### Must-haves
- [x] `gitlab-ctl praefect resync-db` (&6775)
Whose purpose and goal is to start a background operation which checks the file system and the database.
- [x] `gitlab-ctl praefect delete-repo-db <repo>` (https://gitlab.com/gitlab-org/gitaly/-/issues/3769, Omnibus gitlab-org/gitaly#3784)
Whose purpose and goal is to remove a repository from there Praefect tracking database. This can be used in the event a repository is manually removed, or forced to be manually removed.
- [x] `gitlab-ctl praefect delete-repo <repo>` (https://gitlab.com/gitlab-org/gitaly/-/issues/3771, Omnibus gitlab-org/gitaly#3784)
Whose purpose and goal is to remove a repository from the filesystem of the gitaly nodes in addition to the Praefect tacking database. This can be used in events such as a failed Geo sync, where a repo needs to be entirely removed to be redownloaded and synced.
### Possible additions
The following commands are may or may not be needed. Given the resource constraints to the Gitaly team, we want to ensure there is an absolute need to implement the following through customer validation.
- [x] `gitlab-ctl praefect track-repo <repo> <hashed_dir>` (https://gitlab.com/gitlab-org/gitaly/-/issues/3773)
Whose purpose and goal is to add a repository to the Praefect tracking database. This will allow managers of gitlab to track a repository when restored from a snapshot or backup.
- [ ] `gitlab-ctl praefect offline-node <nodeid>` and `gitlab-ctl praefect online-node <nodeid>` (https://gitlab.com/gitlab-org/gitaly/-/issues/3774)
Whose purpose and goal is to ensure this important maintenance process is exposed as an API or command. Such functionality implies the cluster could know things like how far out of sync a node is when re-onlined and optimize resync rather than attempt a full resync. Even if GitLab upgrades do not require this, vertically scaling hardware (including virtual) and adding disk space (depending on virtual platform, OS and FS) will require it.
### Automation
After interviewing multiple customers, we do not believe that there is a strong need to provide automated execution of these commands for reconciliation. Instead, we are choosing to focus on [providing automatic health checks](https://gitlab.com/groups/gitlab-org/-/epics/6804) for issues that have been seen numerous times in the field. Feel free to contribute to that epic if you believe there is a strong need for other automated checks.
### How to identify if I have been affected?
- Scenario 1: When you navigate to a repository in GitLab's UI. The repository files do not display, or a 500 error occurs. That 500 error indicates it can't find the repositories files. Yet the files appear on the Gitaly filesystem.
- Scenario 2: One or more of your repositories suddenly appear in a read only state with no visible reason or means to make them writable.
- Scenario 3: Repository creation fails, with an error that the repository already exists on disk.
### Scenarios in which this occurs.
- Restore of Gitaly/Praefect from snapshot backup (100% of the time)
- GitLab Upgrade to 13.12 (If you have unclean repos)
- Gitaly Cluster in Autoscale Group
- Gitaly Node offline during a write
- Initial Gitlab Geo Sync
- {- Any scenario in which a Gitaly filesystem changes, without Praefect being aware -}
### Steps to reproduce
This issue can be replicated by making a change to the gitaly filesystem that praefect tracks. For example; in a 3 Gitaly node cluster. Take a repository and move it between the Gitaly nodes manually, then try to access the repository. Gitlab will error stating the repo doesn't exist, or can't be found. Despite the fact that 1 or 2 of the Gitaly nodes have the repository on it's filesystem.
The solution to this is to manually modify the database, or to wipe the database entirely and resync the gitaly filesystem from an external source.
### What is the current *bug* behavior?
When Gitaly filesystem changes, Praefect has no understanding of this and breaks in weird and unexpected ways. {- With no supported way to resolve this issue inside Praefect. -}
## Customers Affected
{- Every customer using Gitaly Cluster in GitLab 13.12+ is at risk of running into this scenario. -}
Customers using Gitaly Cluster in GitLab 13.11 and below, are at low risk of this. Because there is no Gitaly primary per repository. There is a single Gitaly Primary. Some of the scenarios listed above may occur in that configuration. However, this issue is not related to them. They'll have a different cause.
## Customers Directly Affected
- https://gitlab.my.salesforce.com/00161000004zrCF
- https://gitlab.my.salesforce.com/0064M00000WtEhO
- https://gitlab.my.salesforce.com/0064M00000XbEpX
- https://gitlab.my.salesforce.com/0064M00000XbEpr
- https://gitlab.my.salesforce.com/0064M00000XbEpX
- https://gitlab.my.salesforce.com/0016100001CXGCs
- https://gitlab.my.salesforce.com/00161000015MAsE
### Relevant logs and/or screenshots
- https://gitlab.slack.com/archives/C3ER3TQBT/p1623697303335700
- https://gitlab.com/gitlab-org/gitlab/-/issues/336532
- https://gitlab.com/gitlab-org/gitlab/-/issues/332902
- https://gitlab.com/gitlab-org/quality/reference-architectures/-/issues/56
### Related Support Tickets
- https://gitlab.zendesk.com/agent/tickets/223793
- https://gitlab.zendesk.com/agent/tickets/221563
- https://gitlab.zendesk.com/agent/tickets/223076
- https://gitlab.zendesk.com/agent/tickets/217866
- https://gitlab.zendesk.com/agent/tickets/225838
- https://gitlab.zendesk.com/agent/tickets/224205
- https://gitlab.zendesk.com/agent/tickets/223880
- https://gitlab.zendesk.com/agent/tickets/218829
- https://gitlab.zendesk.com/agent/tickets/216587
- https://gitlab.zendesk.com/agent/tickets/217962
- https://gitlab.zendesk.com/agent/tickets/200962
epic