[Geo] Migration of files to object storage is not forcing re-verification on Geo primary

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

  • Work on this issue
  • Close this issue

Summary

Migration of files to object storage is not forcing re-verification on Geo primary resulting in checksum mismatches between primary and secondary.

Steps to reproduce

  • Set up Geo with 2 single-node sites.
  • Generate some artifacts.
  • Make sure that verification and replication works fine.
  • Configure object storage on both nodes for at least one type of object. Let's take artifacts as an example.
  • Migrate artifacts to the object storage on primary.
  • Enable GitLab-managed object storage replication on the secondary.
  • Observe the verification status on the secondary.

What is the current bug behavior?

Secondary node will show verification errors in the UI. In the geo.log, it will show errors Downloaded file checksum mismatch like this:

{"severity":"INFO","time":"2024-10-08T11:53:05.312Z","correlation_id":"68765a818bb0b585f6cea95fb4a05dd3","class":"Geo::BlobDownloadService",
"gitlab_host":"geo-secondary.gitlab.tld","message":"Blob download","replicable_name":"job_artifact","model_record_id":511,"mark_as_synced":false,"download_success":false,
"bytes_downloaded":24417,"primary_missing_file":false,"download_time_s":1.38,
"reason":"Downloaded file checksum mismatch","primary_checksum":"6d20ef9c3a68114bb1f80b20c75c627c669b08ed7eb2c8be1abfb62908ab6ac8","actual_checksum":"024417"}

What is the expected correct behavior?

Verification should be completed successfully. Objects migrated from local to object storage should be scheduled for re-verification after the migration is complete.

Additional details

The issue was reproduced with GitLab 17.1.4.

The error is happening because it tries to compare local and remote files. According to this code, the checksum is calculated in a different way for local and remote files: for local ones it is SHA, for remote ones it is size.

Workarounds

To workaround the issue, run this snippet on the primary site $4846478.

Click to expand other workarounds

We can reset verification status on primary manually according to the doc Reverify all components:

Ci::JobArtifact.verification_state_table_class.each_batch do |relation|
  relation.update_all(verification_state: 0)
end

After that, we need to wait for the re-verification to complete, then observe that verification is successful on the secondary:

{"severity":"INFO","time":"2024-10-08T12:38:12.899Z","correlation_id":"5f0ea6e0a41ab9dd68e68469f9af4b12",
"class":"Geo::BlobDownloadService","gitlab_host":"geo-secondary.gitlab.tld","message":"Blob download","replicable_name":"job_artifact",
"model_record_id":68,"mark_as_synced":true,"download_success":true,"bytes_downloaded":1859,"primary_missing_file":false,
"download_time_s":1.742,"reason":null}

Another option is to decrease Re-verification interval for primary, it seems to be 7 days by default.

Implementation Guide

See comments in this issue:

maybe hook into def migrate! or maybe look into these with_callbacks. Are you thinking something like this?

We don't need to trigger an event because this is on the primary site. This happens because we do not mark the verification as pending when migration is done. Marking it as pending will let the verification backfill worker pick it up. After the new checksum is calculated on the primary, we trigger an event to reset it on the secondary.

  • Extend class GitlabUploader with an EE module, see https://docs.gitlab.com/development/ee_features/#extend-ce-features-with-ee-backend-code. I think it will be in ee/app/uploaders/ee/gitlab_uploader.rb.
  • In the new EE module, define after :migrate, :recalculate_checksum_for_geo
  • Define def recalculate_checksum_for_geo and make it mark the primary site's verification state as pending for that resource. Something like model.verification_pending! and guard it with return unless Gitlab::Geo.primary?
  • Add unit tests in ee/ for GitlabUploader, for describe '#migrate!' do and describe '#recalculate_checksum_for_geo!' do. Search the codebase for #migrate! for relevant similar examples.
Edited Jun 13, 2025 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading