[Geo] Migration of files to object storage is not forcing re-verification on Geo primary
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
Migration of files to object storage is not forcing re-verification on Geo primary resulting in checksum mismatches between primary and secondary.
Steps to reproduce
- Set up Geo with 2 single-node sites.
- Generate some artifacts.
- Make sure that verification and replication works fine.
- Configure object storage on both nodes for at least one type of object. Let's take artifacts as an example.
- Migrate artifacts to the object storage on primary.
- Enable GitLab-managed object storage replication on the secondary.
- Observe the verification status on the secondary.
What is the current bug behavior?
Secondary node will show verification errors in the UI. In the geo.log, it will show errors Downloaded file checksum mismatch like this:
{"severity":"INFO","time":"2024-10-08T11:53:05.312Z","correlation_id":"68765a818bb0b585f6cea95fb4a05dd3","class":"Geo::BlobDownloadService",
"gitlab_host":"geo-secondary.gitlab.tld","message":"Blob download","replicable_name":"job_artifact","model_record_id":511,"mark_as_synced":false,"download_success":false,
"bytes_downloaded":24417,"primary_missing_file":false,"download_time_s":1.38,
"reason":"Downloaded file checksum mismatch","primary_checksum":"6d20ef9c3a68114bb1f80b20c75c627c669b08ed7eb2c8be1abfb62908ab6ac8","actual_checksum":"024417"}
What is the expected correct behavior?
Verification should be completed successfully. Objects migrated from local to object storage should be scheduled for re-verification after the migration is complete.
Additional details
The issue was reproduced with GitLab 17.1.4.
The error is happening because it tries to compare local and remote files. According to this code, the checksum is calculated in a different way for local and remote files: for local ones it is SHA, for remote ones it is size.
Workarounds
To workaround the issue, run this snippet on the primary site $4846478.
Click to expand other workarounds
We can reset verification status on primary manually according to the doc Reverify all components:
Ci::JobArtifact.verification_state_table_class.each_batch do |relation|
relation.update_all(verification_state: 0)
end
After that, we need to wait for the re-verification to complete, then observe that verification is successful on the secondary:
{"severity":"INFO","time":"2024-10-08T12:38:12.899Z","correlation_id":"5f0ea6e0a41ab9dd68e68469f9af4b12",
"class":"Geo::BlobDownloadService","gitlab_host":"geo-secondary.gitlab.tld","message":"Blob download","replicable_name":"job_artifact",
"model_record_id":68,"mark_as_synced":true,"download_success":true,"bytes_downloaded":1859,"primary_missing_file":false,
"download_time_s":1.742,"reason":null}
Another option is to decrease Re-verification interval for primary, it seems to be 7 days by default.
Implementation Guide
See comments in this issue:
maybe hook into def migrate! or maybe look into these
with_callbacks. Are you thinking something like this?
We don't need to trigger an event because this is on the primary site. This happens because we do not mark the verification as pending when migration is done. Marking it as pending will let the verification backfill worker pick it up. After the new checksum is calculated on the primary, we trigger an event to reset it on the secondary.
-
Extend class GitlabUploaderwith anEEmodule, see https://docs.gitlab.com/development/ee_features/#extend-ce-features-with-ee-backend-code. I think it will be inee/app/uploaders/ee/gitlab_uploader.rb. -
In the new EEmodule, defineafter :migrate, :recalculate_checksum_for_geo -
Define def recalculate_checksum_for_geoand make it mark the primary site's verification state as pending for that resource. Something likemodel.verification_pending!and guard it withreturn unless Gitlab::Geo.primary? -
Add unit tests in ee/forGitlabUploader, fordescribe '#migrate!' doanddescribe '#recalculate_checksum_for_geo!' do. Search the codebase for#migrate!for relevant similar examples.