Skip to content

Teach Praefect dataloss command to list which Gitaly nodes are missing data

Problem to solve

The praefect dataloss subcommand provides insight into which repository replications have failed for a time window. If a there are two replicas, and the job fails for one of the replicas, but succeeds on the other, data loss is still reported.

Further details

Current subcommand output:

Failed replication jobs between [2020-01-02 00:00:00 +0000 UTC, 2020-01-03 00:00:00 +0000 UTC):
test-repo/relative-path/1: 1 jobs
test-repo/relative-path/2: 4 jobs
test-repo/relative-path/3: 2 jobs

Proposal

For each repository that is suspected to have data loss on one or more nodes, list which Gitaly nodes are suspected of having missing data.

Assuming gitaly-1 was the primary and just went down:

Failed replication jobs between [2020-01-02 00:00:00 +0000 UTC, 2020-01-03 00:00:00 +0000 UTC):
test-repo/relative-path/1: gitaly-3, gitaly-3
test-repo/relative-path/2: gitaly-2
test-repo/relative-path/3: gitaly-3
Edited by James Ramsay (ex-GitLab)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information