Geo: Update documentation for data inconsistency issues exposed by Geo when migrating artifacts to object storage.
Summary
If some local objects were not migrated to an object storage, Geo still tries to verify/replicate them even with object storage configured.
Steps to reproduce
- Have a working Geo instance with all the objects stored locally.
- Configure object storage for the primary site. I am using artifacts as an example of object type below, but I assume it applies to other objects as well.
- Before running migration of artifacts to object storage, introduce some inconsistency: for instance, delete several local artifact files on the primary. It's a rather common scenario for our customers when some local files are missing.
- Perform migration to object storage. You will get some errors about non-existing files, but overall the migration should be successful.
- Configure secondary Geo site with the same S3 buckets. This option is allowed according to the doc Geo with Object storage:
The same storage bucket as the primary site.
Make sure that the checkboxAllow this secondary node to replicate content on Object Storage
is NOT selected. In this case, Geo won't try to replicate anything. - Reset the verification status on the primary using the snippet like below:
Ci::JobArtifact.verification_state_table_class.each_batch do |relation|
relation.update_all(verification_state: 0)
end
- Wait for verification to be completed.
- On the secondary, click Resync All / Reverify all in the UI.
What is the current bug behavior?
The secondary will try to replicate the local non-existing artifacts from the primary instance. Even though the object storage is configured, it still treats any local objects as the ones that should be replicated.
What is the expected correct behavior?
The secondary should not try to replicate any local files if the object storage is configured.
Relevant logs and/or screenshots
- In my case, I removed 18 local artifacts, and totally I had more than 100 artifacts. I expect that, because verification / replication of object is not done by secondary, it will show zeroes for artifacts, but it does not:
GitLab Version: 17.3.5-ee
Geo Role: Secondary
Health Status: Healthy
Lfs Objects: succeeded 0 / total 0 (0%)
Merge Request Diffs: succeeded 0 / total 0 (0%)
Package Files: succeeded 0 / total 0 (0%)
Terraform State Versions: succeeded 0 / total 0 (0%)
Snippet Repositories: succeeded 2 / total 2 (100%)
Group Wiki Repositories: succeeded 5 / total 5 (100%)
Pipeline Artifacts: succeeded 0 / total 0 (0%)
Pages Deployments: succeeded 0 / total 0 (0%)
Uploads: succeeded 0 / total 0 (0%)
Job Artifacts: failed 18 / succeeded 0 / total 18 (0%)
Ci Secure Files: succeeded 0 / total 0 (0%)
Dependency Proxy Blobs: succeeded 0 / total 0 (0%)
Dependency Proxy Manifests: succeeded 0 / total 0 (0%)
Project Wiki Repositories: succeeded 47 / total 47 (100%)
Design Management Repositories: succeeded 2 / total 3 (66%)
Project Repositories: succeeded 49 / total 49 (100%)
Lfs Objects Verified: succeeded 0 / total 0 (0%)
Merge Request Diffs Verified: succeeded 0 / total 0 (0%)
Package Files Verified: succeeded 0 / total 0 (0%)
Terraform State Versions Verified: succeeded 0 / total 0 (0%)
Snippet Repositories Verified: succeeded 2 / total 2 (100%)
Group Wiki Repositories Verified: succeeded 5 / total 5 (100%)
Pipeline Artifacts Verified: succeeded 0 / total 0 (0%)
Pages Deployments Verified: succeeded 0 / total 0 (0%)
Uploads Verified: succeeded 0 / total 0 (0%)
Job Artifacts Verified: succeeded 0 / total 18 (0%)
Ci Secure Files Verified: succeeded 0 / total 0 (0%)
Dependency Proxy Blobs Verified: succeeded 0 / total 0 (0%)
Dependency Proxy Manifests Verified: succeeded 0 / total 0 (0%)
Project Wiki Repositories Verified: succeeded 47 / total 47 (100%)
Design Management Repositories Verified: succeeded 2 / total 3 (66%)
Project Repositories Verified: succeeded 49 / total 49 (100%)
Sync Settings: Full
Database replication lag: 0 seconds
Last event ID seen from primary: 165310 (5 minutes ago)
Last event ID processed: 165310 (5 minutes ago)
Last status report was: 1 minute ago
Additional information
As discussed with Sampath internally, I am including some additional information. If artifacts were not fully migrated to object storage, the replication will look interesting:
- Say, on my instance I have 415 artifacts totally, and 11 of them were not migrated to object storage for whatever reasons.
- In primary section, I see checksum progress for my artifacts - there are 415 of them, all green
- In secondary section, I see successful verification and replication for 11 local local artifacts:
- So, Geo still replicates local artifacts when not all of them were migrated to object storage
This issue is partially related to a proposal I've created about improving Geo UI: #504676
Implementation Details
-
Update documentation with information about how to properly dispose of local files and the related database records - with the
destroy
method. Geo with object storage and Object Storage's Troubleshooting section are good candidates to be updated.You can find the destroy method in all model classes that represent the data you want to delete. For example:
Ci::JobArtifact.find(foo).destroy
This should delete the file, cascade to the states table (in this case, the related rows in
ci_job_artifacts_states
) and create a geo event that says the data was deleted, what will make all secondary sites to delete the relevant files and database records.