Geo: Project repo primary checksum out-of-date
Problem
On staging-ref.gitlab.com, some project repositories had out-of-date checksums, which caused verification to fail persistently on the secondary site, even though the secondary site has matching data. This also causes resyncs, which is wasteful.
This state eventually self-resolves, at the latest at the re-verification interval (because the primary will rechecksum at that time). This defaults to 7 days but is user-configurable; I often suggest 30-90 days to reduce resource usage. Also, a push to the project will cause a rechecksum on the primary (which will resolve the verification failure).
A customer experienced this as well #426778 (comment 1588782813)
Possible fixes
We do not yet know how these registry rows got into this state.
- Maybe we missed triggering checksumming, somewhere that repos get updated.
- Maybe there is a race condition.
- Maybe it's related to some other recent bug
Workaround
NOTE: The following workaround does not resolve all failures but has been effective for some customers.
If you don't have many of these sync failures
-
In a Rails console in the secondary site, get a list of affected Project IDs, like:
irb(main):009:0> Geo::ProjectRepositoryRegistry.where("last_sync_failure like 'Verification failed with: Checksum does not match%'").pluck(:project_id) => [152873, 151998, 152041]
-
In a Rails console in the primary site, recalculate the checksum of each affected Project (this command does it synchronously):
Project.where(id: [152873, 151998, 152041]).find_each { |p| puts p.replicator.verify }
-
In a Rails console in the secondary site, set those repos to retry failed sync now (syncs may be heavy so this makes the work happen in the background):
Geo::ProjectRepositoryRegistry.failed.where("last_sync_failure like 'Verification failed with: Checksum does not match%'").update_all(retry_at: 1.hour.ago)
If you have a lot of sync failures
You can make the primary re-checksum all project repos like so.
This should cause Geo to re-checksum all projects on the primary:
- SSH into a Rails node in the primary Geo site
gitlab-rails console
Geo::ProjectState.each_batch { |relation| relation.update_all(verification_state: 0) }
If the load is too high on the primary, reduce "Verification concurrency limit" in Admin > Geo > Sites > Edit (the Primary site). The default of 100 divided by all data types allows up to 5 project repo verification jobs to run concurrently.