Skip to content

Investigate the Geo error: Checksum does not match the primary checksum on ContainerRepository records

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

We have seen at least two instances where re-verifying the objects in primary did not help resolve the error Checksum does not match the primary checksum . Following is what we tried so far:

### Primary ###
ContainerRepository.in_batches(of: 100) do |batch|
  batch.each do |c|
    event = c.replicator.verify
    puts "Container Repository ID #{c.id}: #{event.event_name}"
  end
end
### Secondary ###
Geo::ContainerRepositoryRegistry.failed.find_each do |registry|
   begin
     registry.replicator.sync
     puts "Sync initiated for registry ID: #{registry.id}"
   rescue => e
     puts "ID: #{registry.id}, Project ID: #{registry.project_id}, Failed: '#{e}'", e.backtrace.join("\n")
   end
end; nil

We noticed that re-verifying the objects on the primary yields the same checksum, but the sync process fails midway, and the checksum of the partially synced data on the secondary differs (as expected) during subsequent verification. We also tried deleting the failed container repository data from the disk on the secondary to remove the metadata, but that didn’t help either.

The container repositories include "tags" which have a digest attribute, and the checksum for these is calculated by combining these digests. In some cases, this attribute is nil on the secondary and the checksum mismatch occurs.

This issue is for collecting all cases where re-verification of failed objects on the primary did not resolve the problem, as well as for investigating this error message in detail and providing a fix or updating the documentation as needed. The data issue with the tags should be part of the investigation, as it might be at least part of the root cause of the mismatches.

Steps to reproduce

Not available.

Example Project

What is the current bug behavior?

Replication of some objects fails with Checksum does not match the primary checksum.

What is the expected correct behavior?

Replication should work as expected.

Relevant logs and/or screenshots

See following RFH issues:

Output of checks

Issue was reproducible from 17.11 through 18.2.

Results of GitLab environment info

Expand for output related to GitLab environment info

   (For installations with omnibus-gitlab package run and paste the output of: \\\\\\\`sudo gitlab-rake gitlab:env:info\\\\\\\`)  (For installations from source run and paste the output of: \\\\\\\`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production\\\\\\\`)    

Results of GitLab application Check

Expand for output related to the GitLab application check

  (For installations with omnibus-gitlab package run and paste the output of: \\\`sudo gitlab-rake gitlab:check SANITIZE=true\\\`)  (For installations from source run and paste the output of: \\\`sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true\\\`)  (we will only investigate if the tests are passing)   

Possible fixes

Patch release information for backports

If the bug fix needs to be backported in a patch release to a version under the maintenance policy, please follow the steps on the patch release runbook for GitLab engineers.

Refer to the internal "Release Information" dashboard for information about the next patch release, including the targeted versions, expected release date, and current status.

High-severity bug remediation

To remediate high-severity issues requiring an internal release for single-tenant SaaS instances, refer to the internal release process for engineers.

Edited by 🤖 GitLab Bot 🤖