Skip to content

Geo: Project repo primary checksum out-of-date

Problem

On staging-ref.gitlab.com, some project repositories had out-of-date checksums, which caused verification to fail persistently on the secondary site, even though the secondary site has matching data. This also causes resyncs, which is wasteful.

This state eventually self-resolves, at the latest at the re-verification interval (because the primary will rechecksum at that time). This defaults to 7 days but is user-configurable; I often suggest 30-90 days to reduce resource usage. Also, a push to the project will cause a rechecksum on the primary (which will resolve the verification failure).

A customer experienced this as well #426778 (comment 1588782813)

Possible fixes

We do not yet know how these registry rows got into this state.

  • Maybe we missed triggering checksumming, somewhere that repos get updated.
  • Maybe there is a race condition.
  • Maybe it's related to some other recent bug

Workaround

Workaround: $4846478

Click to expand old workarounds

NOTE: The following workaround does not resolve all failures but has been effective for some customers.

If you don't have many of these sync failures

You can recalculate the checksum only for those affected.

  1. On the secondary site, run this to see what Replicators are available:

    ::Gitlab::Geo::REPLICATOR_CLASSES
  2. On the secondary site, copy and paste this into Rails console (replace Geo::SnippetRepositoryReplicator with your desired replicator):

    replicator_class = Geo::SnippetRepositoryReplicator
    
    def output_affected_model_ids(replicator_class:, limit: 100, offset: 0)
      model_ids = replicator_class
        .registry_class
        .where("last_sync_failure like 'Verification failed with: Checksum does not match%'")
        .limit(limit)
        .offset(offset)
        .pluck(replicator_class.registry_class::MODEL_FOREIGN_KEY)
        .join(',')
    end
    
    def resync_affected(replicator_class:)
      replicator_class
        .registry_class
        .failed
        .where("last_sync_failure like 'Verification failed with: Checksum does not match%'")
        .update_all(retry_at: 1.day.ago)
    end
    
    def total_affected(replicator_class:)
      replicator_class
        .registry_class
        .where("last_sync_failure like 'Verification failed with: Checksum does not match%'")
        .pluck(replicator_class.registry_class::MODEL_FOREIGN_KEY)
        .count
    end
  3. On the primary site, copy and paste this into Rails console (replace Geo::SnippetRepositoryReplicator with your desired replicator):

    replicator_class = Geo::SnippetRepositoryReplicator
    
    def reverify_by_id(replicator_class:, ids: [])
      replicator_class.model.primary_key_in(ids).find_each { |p| puts p.replicator.verify }
    end
  4. On the secondary site, run this to get an idea of how many you will need to do:

    total_affected(replicator_class: replicator_class)
  5. On the secondary site:

    output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 0)

    This will output a list of IDs, like:

    > output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 0)
    => "1234,5678,9012"

    If there are more than 100 affected, and you already ran this once and want to get another 100 IDs, then replace 0 with 100. You can also try a higher limit.

  6. On the primary site (copy and paste the actual comma-separated IDs that were output on the secondary site):

    reverify_by_id(replicator_class: Geo::SnippetRepositoryReplicator, ids: [1234,5678,9012])
  7. If you have more than 100 affected rows, then repeat steps 4 and 5. Increase the offset by the limit you used to get a next batch of IDs. For example, after the first time:

    output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 100)

    And after the second time:

    output_affected_model_ids(replicator_class: replicator_class, limit: 100, offset: 200)
  8. After you've repeated steps 4 and 5 until there are no more IDs being returned, then on the secondary site, resync all affected (replace with the actual replicator_class):

    resync_affected(replicator_class: replicator_class)

If you have a lot of sync failures

You can make the primary re-checksum all project repos like so.

This should cause Geo to re-checksum all projects on the primary:

  1. SSH into a Rails node in the primary Geo site
  2. gitlab-rails console
  3. Geo::ProjectRepositoryReplicator.model.verification_state_table_class.each_batch { |relation| relation.update_all(verification_state: 0) }

If the load is too high on the primary, reduce "Verification concurrency limit" in Admin > Geo > Sites > Edit (the Primary site). The default of 100 divided by all data types allows up to 5 project repo verification jobs to run concurrently.

You can replace Geo::ProjectRepositoryReplicator with any Replicator.

Edited by Michael Kozono