Skip to content

Initial implementation of a metadata verifier

Sami Hiltunen requested to merge smh-background-verifier into master

This MR adds an initial implementation of a metadata verifier to Praefect.

Praefect stores metadata of the repositories stored on the cluster in Postgres. These metadata records may become out of sync with the disks if changes occur on the disks without going through Praefect, for example due to disk failures or manual modifications. Right now, Praefect only contains some temporary logic to clean up invalid metadata records when replication is attempted using a non-existent source repository. This was mostly put in place to stop reconciliation loops where Praefect keeps scheduling replication jobs from the non-existent repository that will never succeed. While this performs some clean up, it's not sufficient to catch cases where something happens in the background without prompting replication.

The metadata verifier introduced in this commit aims to catch these issues by verifying the metadata eveynow and then in the background with the state on the disks. For now, only the existence of the replica is verified, not the actual contents by checksumming.

Each replica contains a 'verified_at' timestamp in the database that tells Praefect when the metadata record was last verified. If it exceeds a configurable threshold, the replica is considered to be due for reverification. Praefect then asks the Gitaly hosting the replica whether the replica still exists. If it doesn't the invalid metadata record is deleted and the removal is logged. To avoid multiple Praefects verifying the same replica concurrently, Praefect acquires the verification lease on the replica in the database prior to verifying the existence of the repository.

The scheduling is fairly simplistic at the moment with each Praefect acquiring a batch of work every two seconds. This also serves as a crude way to rate limit the background verification work rather to avoid consuming too many resources while doing it. This should be sufficient for now althoug could later be improved.

Praefect leaves the repository's record in place even if all of its replicas have been lost. This ensures no data loss goes unnoticed and that the loss needs to be acknowledged by removing the repository manually.

Part of #4080 (closed)

Edited by Sami Hiltunen

Merge request reports