No reconciliation when gitaly node entirely lost
In the event that a gitaly node in a gitaly cluster dies and loses all its data, Praefect's reconciliation never schedules replication jobs to backfill the gitaly node when it is rebuilt.
The issues seems to be that Praefect just believes that if there's a storage_repositories entry for that gitaly node, then it must be ok; it never verifies files are on disk.
I'm able to workaround it by deleting all the storage_repositories entries for that gitaly node; Praefect then recognizes the need to schedule replication jobs. Ideally, there would be either a praefect command to reset the state of a gitaly node (ie delete all the storage_repositories entries for it), or reconciliation would make a gitaly call to the node to verify the repo exists and schedule a replication job if not.
To reproduce:
- Setup praefect with 2 gitaly nodes
- Ensure all repos are fully replicated on both nodes
- Simulate failure on one node:
- gitlab-ctl stop gitaly
- mv /var/opt/gitlab/git-data/repositories /var/opt/gitlab/git-data/repositories.old
- mkdir /var/opt/gitlab/git-data/repositories
- chown git:git /var/opt/gitlab/git-data/repositories
- gitlab-ctl start gitaly
- Praefect will fail to repopulate this node