Geo: Project repo primary checksum out-of-date
Problem
On staging-ref.gitlab.com, some project repositories had out-of-date checksums, which caused verification to fail persistently on the secondary site, even though the secondary site has matching data. This also causes resyncs, which is wasteful.
This state eventually self-resolves, at the latest at the re-verification interval (because the primary will rechecksum at that time). This defaults to 7 days but is user-configurable; I often suggest 30-90 days to reduce resource usage. Also, a push to the project will cause a rechecksum on the primary (which will resolve the verification failure).
A customer experienced this as well #426778 (comment 1588782813)
Possible fixes
We do not yet know how these registry rows got into this state.
- Maybe we missed triggering checksumming, somewhere that repos get updated.
- Maybe there is a race condition.
- Maybe it's related to some other recent bug
Workaround
Workaround: $4846478
Click to expand old workarounds
NOTE: The following workaround does not resolve all failures but has been effective for some customers.
If you don't have many of these sync failures
You can recalculate the checksum only for those affected.
-
On the secondary site, run this to see what Replicators are available:
::Gitlab::Geo::REPLICATOR_CLASSES -
On the secondary site, copy and paste this into Rails console (replace
Geo::SnippetRepositoryReplicatorwith your desired replicator):replicator_class = Geo::SnippetRepositoryReplicator def output_affected_model_ids(replicator_class:, limit: 100, offset: 0) model_ids = replicator_class .registry_class .where("last_sync_failure like 'Verification failed with: Checksum does not match%'") .limit(limit) .offset(offset) .pluck(replicator_class.registry_class::MODEL_FOREIGN_KEY) .join(',') end def resync_affected(replicator_class:) replicator_class .registry_class .failed .where("last_sync_failure like 'Verification failed with: Checksum does not match%'") .update_all(retry_at: 1.day.ago) end def total_affected(replicator_class:) replicator_class .registry_class .where("last_sync_failure like 'Verification failed with: Checksum does not match%'") .pluck(replicator_class.registry_class::MODEL_FOREIGN_KEY) .count end -
On the primary site, copy and paste this into Rails console (replace
Geo::SnippetRepositoryReplicatorwith your desired replicator):replicator_class = Geo::SnippetRepositoryReplicator def reverify_by_id(replicator_class:, ids: []) replicator_class.model.primary_key_in(ids).find_each { |p| puts p.replicator.verify } end -
On the secondary site, run this to get an idea of how many you will need to do:
total_affected(replicator_class: replicator_class) -
On the secondary site:
output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 0)This will output a list of IDs, like:
> output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 0) => "1234,5678,9012"If there are more than 100 affected, and you already ran this once and want to get another 100 IDs, then replace
0with100. You can also try a higher limit. -
On the primary site (copy and paste the actual comma-separated IDs that were output on the secondary site):
reverify_by_id(replicator_class: Geo::SnippetRepositoryReplicator, ids: [1234,5678,9012]) -
If you have more than 100 affected rows, then repeat steps 4 and 5. Increase the offset by the limit you used to get a next batch of IDs. For example, after the first time:
output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 100)And after the second time:
output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 200) -
After you've repeated steps 4 and 5 until there are no more IDs being returned, then on the secondary site, resync all affected (replace with the actual replicator_class):
resync_affected(replicator_class: replicator_class)
If you have a lot of sync failures
You can make the primary re-checksum all project repos like so.
This should cause Geo to re-checksum all projects on the primary:
- SSH into a Rails node in the primary Geo site
gitlab-rails consoleGeo::ProjectRepositoryReplicator.model.verification_state_table_class.each_batch { |relation| relation.update_all(verification_state: 0) }
If the load is too high on the primary, reduce "Verification concurrency limit" in Admin > Geo > Sites > Edit (the Primary site). The default of 100 divided by all data types allows up to 5 project repo verification jobs to run concurrently.
You can replace Geo::ProjectRepositoryReplicator with any Replicator.