Skip to content

Infinite attempts for delete_replica jobs

Sami Hiltunen requested to merge smh-delete-replica-infinite-attempts into master

delete_replica type jobs are emitted by the reconciler for deleting repository replicas from unassigned storages. When processing the jobs, Praefect first deletes the repository from the disk and then updates the database state to match. If deleting the repository from the disk succeeds but updating the database state fails, Praefect would reattempt the job. If all of the attempts fail and Praefect drops the job, the replica would have been deleted from the disk but the database state still implies that the storage should contain a replica of the repository.

We can't allow for that to happen as the reconciler could schedule additional delete_replica jobs deleting the last copies of the repository as it still sees the invalid database state indicating a repository exists on the storage where it was deleted from. To avoid this scenario, we give infinite attempts to delete_replica type jobs to ensure we never delete a replica from the disk without having a record indicating it might have been done or is about to be done. Reconciler allows only one delete_replica job to be scheduled for a given repository, which then avoids the scenario where we delete all replicas based on inconsistent database state.

Related to: !3162 (comment 512705472)

Merge request reports