Geo: SSF verification failures on secondary when the resource cannot possibly be verified
Problem 1
Verification of each resource on a secondary site requires the resource to have a checksum on the primary site already.
Currently:
- On a secondary, when querying for verification work, assume the primary checksum exists, and attempt verification.
- If the primary checksum doesn't exist, then consider it a failed verification attempt.
This is not strictly wrong, but it is somewhat wasteful, and shows failures when the primary has not checksummed the resource yet. Progressive backoff limits the waste, but transient failures would be annoying to admins.
Obstacle
needs_verification_count
, verification_pending_batch
, and verification_failed_batch
in VerificationBatchWorker
should not return registries where the model record doesn't already have a checksum. The problem is, we must not perform a cross-database query, since that doesn't scale well, and these methods may be called frequently. So the knowledge of "does the primary have a checksum?" must already be "copied" to the registry table by some other process.
Problem 2
We currently exclude remote-stored blobs from verification. But if you enable the beta feature "GitLab managed object storage replication" https://docs.gitlab.com/ee/administration/geo/replication/object_storage.html#enabling-gitlab-managed-object-storage-replication, then we will sync them. The secondary will attempt to verify those registries, but the logic to checksum them does not work for remote-stored things. And, the primary will never checksum them.
Obstacle
needs_verification_count
, verification_pending_batch
, and verification_failed_batch
in VerificationBatchWorker
should not return registries where the model record says it is remote-stored. The problem is, we must not perform a cross-database query, since that doesn't scale well, and these methods may be called frequently. So knowledge of "is remote stored" must already be "copied" to the registry table by some other process.
verification_state
: verification_disabled
Proposal: A new -
Add a verification_state
calledverification_disabled
. It will include remote-stored things, and on secondaries, things that aren't checksummed on the primary yet. -
When a registry becomes synced, if the model record is not in verification_succeeded, then mark verification disabled -
When verifying a registry, if the model record is not in verification_succeeded, then mark verification disabled -
VerifiableReplicator.verification_total_count
needs to excludeverification_disabled
No need to worry about losing checksum_succeeded
events because worst-case, if it is a resource that can be verified, then the primary will reverify it later.
Follow up created for #363544