Geo: Verification failure should cause resync, not retry of verification

Problem

The Geo Self-Service Framework verifies PackageFiles but currently it just retries verification if verification failed.

The behavior of project repos is basically: If verification failed for any reason, then update registry as if sync failed.

  • https://gitlab.com/gitlab-org/gitlab/-/blob/v13.8.3-ee/ee/app/services/geo/repository_verification_secondary_service.rb#L51-89
  • !6759 (merged)

This automatic healing behavior is very valuable, and already proven on project repos.

Proposal

Let's use similar behavior in the SSF.

To do

  • On verification_state transition to verification_failed, transition state to failed, setting retry_count to verification_retry_count, to ensure progressive backoff of syncs-due-to-verification-failures
  • Do #301247 (closed). On state transition to synced, transition verification_state to verification_pending
    • If transitioning from verification_failed then don't clear verification_retry_count, to ensure progressive backoff of syncs-due-to-verification-failures

Why worry about "syncs-due-to-verification-failures"?

E.g. the primary checksum is wrong, so the secondary has persistent sync "success" plus verification "failure". The naive behavior here would be a verification/sync loop with no progressive backoff. This problem exists for project repos: #208247

Edited Jul 29, 2021 by Michael Kozono
Assignee Loading
Time tracking Loading