Geo: invalid lfs object deletion on secondary when managed object replication is disabled
Summary
Customer has configured a primary and two secondary GitLab instances with Geo. They are using object storage in S3 for LFS objects, but are currently using AWS replication for the storage buckets rather than GitLab managed replication of the storage buckets. What they are seeing is that objects are replicated to the secondary bucket by AWS replication and then they are deleted by GitLab's Geo::RegistryConsistencyWorker
on the secondary host.
Sidekiq json output
{ "id": "AQAAAYKyjVHpQIyyCQAAAABBWUt5alZwN0FBQWlVR0VrSGVGbGhnQXk", "content": { "timestamp": "2022-08-18T20:03:02.249Z", "tags": [ "filename:current", "service:sidekiq", "source:gitlab", "availability-zone:us-east-1c", "environment:tools", "gitlab_geo_deployment:secondary", "gitlab_geo_full_role:geo_secondary_site_sidekiq_secondary", "gitlab_geo_site:geo_secondary_site", "gitlab_node_level:sidekiq_secondary", "gitlab_node_prefix:gitlab", "gitlab_node_type:sidekiq", "iam_profile:gitlab-ec2-s3-access-profile", "image:REDACTED", "instance-type:m5.2xlarge", "kernel:none", "name:gitlab-tools-ue1-sidekiq-3", "region:us-east-1", "ec2-security-backup:true", "security-group:REDACTED", "terraform:true", "terraformmodule:gitlab-compute-resources:1.0.2", "terragruntmodule:tools/us-east-1/gitlab-onprem/gitlab-sidekiq-ec2" ], "host": "gitlab-tools-ue1-sidekiq-3", "service": "sidekiq", "message": "Geo::DestroyWorker JID-7e96c53d57113e8071439d20: done: 0.386539 sec", "attributes": { "service_name": "sidekiq", "pid": 161340, "dc": { "correlation_id": "c3609ea7af439c1ead4ed4bf9a45fa40" }, "db_main_wal_cached_count": 0, "redis_duration_s": 0.003785, "db_write_count": 3, "version": 0, "redis_queues_read_bytes": 10, "db_main_duration_s": 0.002, "mem_bytes": 4366512, "db_main_replica_wal_cached_count": 0, "enqueued_at": "2022-08-18T20:03:01.861Z", "redis_shared_state_read_bytes": 2, "db_primary_wal_count": 0, "job_size_bytes": 21, "correlation_id": "c3609ea7af439c1ead4ed4bf9a45fa40", "meta": { "client_id": "ip/", "feature_category": "geo_replication", "caller_id": "Geo::Secondary::RegistryConsistencyWorker" }, "redis_read_bytes": 418, "redis_cache_duration_s": 0.001441, "mem_objects": 35863, "idempotency_key": "resque:gitlab:duplicate:geo:geo_destroy:c526d80de74aded0756b9e716e855aeaaf47de5aa70025c2a4b5b8a199b8c919", "db_main_wal_count": 0, "db_primary_cached_count": 0, "job_status": "done", "queue": "geo:geo_destroy", "db_replica_wal_count": 0, "retry": 3, "redis_calls": 5, "severity": "INFO", "args": [ "lfs_object", "108934" ], "worker_data_consistency": "always", "db_replica_duration_s": 0, "redis_queues_duration_s": 0.000404, "db_primary_duration_s": 0.007, "completed_at": "2022-08-18T20:03:02.249Z", "class": "Geo::DestroyWorker", "redis_queues_calls": 1, "db_main_replica_wal_count": 0, "db_primary_count": 5, "db_main_replica_duration_s": 0, "redis_shared_state_write_bytes": 332, "db_duration_s": 0.006393, "scheduling_latency_s": 0.002136, "db_main_cached_count": 0, "db_main_count": 1, "jid": "7e96c53d57113e8071439d20", "external_http_duration_s": 0.30817187190405093, "redis_shared_state_calls": 2, "redis_cache_write_bytes": 129, "db_main_replica_count": 0, "db_replica_count": 0, "db_replica_wal_cached_count": 0, "cpu_s": 0.050792, "load_balancing_strategy": "primary", "size_limiter": "validated", "mem_mallocs": 16594, "external_http_count": 9, "db_cached_count": 0, "redis_shared_state_duration_s": 0.00194, "db_replica_cached_count": 0, "mem_total_bytes": 5801032, "queue_namespace": "geo", "redis_queues_write_bytes": 328, "duration_s": 0.386539, "redis_cache_calls": 2, "time": "2022-08-18T20:03:02.249Z", "redis_write_bytes": 789, "db_count": 5, "redis_cache_read_bytes": 406, "db_main_replica_cached_count": 0, "created_at": "2022-08-18T20:03:01.837Z", "db_primary_wal_cached_count": 0 } } }This seems to be associated with a bug while finding the LFS Object for the current secondary site in https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/ee/lfs_object.rb#L43-47 when object storage is enabled, and GitLab's managed object replication is disabled. When sync object storage is disabled, we return only LFS object whether the file is stored locally, and this relation will not include the new LFS object that is on object storage. So, the Geo::RegistryConsistenceService triggers the deletion.
Workaround
Use GitLab managed object replication
Root cause
From !95937 (comment 1073517003):
The customer is experiencing the bug described in #371397 (closed) while importing Bitbucket projects. This happens because during the LFS objects import phase we:
- Download the file into a temporary folder
- Create an entry in the
lfs_objects
table withfile_store = 1
- Move the file to the object store
- Update the
file_store
columnSince
LfsObject.replicables_for_current_seconndary
return only LFS objects whether the file is stored locally (file_store = 1
) when the GitLab managed-replication is disabled, this relation will include the LFS object created in step 2 above and theGeo::RegistryConsistenceWorker
will create a registry in the tracking database. The next timeGeo::RegistryConsistenceWorker
runs, it removes the registry rows improperly created due to the workflow described above. We can validate this locally following steps 1-11 in !95937 (merged), then:
In a Rails console session on your primary site, create an LFS object:
> LfsObject.safe_find_or_create_by!(oid: 'f2b0a1e7550e9b718dafc9b525a04879a766de62e4fbdfc46593d47f7ab74642', size: 10.kilobytes) => #<LfsObject:0x00000001186819b0 id: 6, oid: "f2b0a1e7550e9b718dafc9b525a04879a766de62e4fbdfc46593d47f7ab74642", size: 51200, created_at: Mon, 22 Aug 2022 23:32:13.297947000 UTC +00:00, updated_at: Mon, 22 Aug 2022 23:32:13.297947000 UTC +00:00, file: "a1e7550e9b718dafc9b525a04879a766de62e4fbdfc46593d47f7ab74642", file_store: 1, verification_checksum: nil>
In a Rails console session on your secondary site, check the latest
Geo::LfsObjectRegistry
record:> Geo::LfsObjectRegistry.last => #<Geo::LfsObjectRegistry:0x0000000130853000 id: 6, created_at: Mon, 22 Aug 2022 23:32:16.167890000 UTC +00:00, retry_at: Mon, 22 Aug 2022 23:33:01.291105000 UTC +00:00, bytes: nil, lfs_object_id: 6, retry_count: 1, missing_on_primary: false, success: false, sha256: nil, state: 3, last_synced_at: Mon, 22 Aug 2022 23:32:16.166723000 UTC +00:00, last_sync_failure: "The file is missing on the Geo primary site", verification_started_at: nil, verified_at: nil, verification_retry_at: nil, verification_retry_count: 0, verification_state: 0, checksum_mismatch: false, verification_checksum: nil, verification_checksum_mismatched: nil, verification_failure: nil>
That's why we also have the same check introduced in this MR in the GitLab::Geo::Replication to avoid downloading unnecessary LFS objects.
If someone wants to go deeper into the current implementation, follow the method calls called in the workflow above:
Projects::ImportService#execute
Projects::ImportService#download_lfs_objects
Projects::LfsPointers::LfsImportService.new(project).execute
Projects::LfsDownloadService.new(project, lfs_download_object).execute
Projects::LfsDownloadService.find_or_create_lfs_object
LfsObject.safe_find_or_create_by!(oid: lfs_oid, size: lfs_size)
lfs_obj.update!(file: tmp_file) unless lfs_obj.file.file
The bug was introduced in !80007 (merged) and only affects the following scenarios on GitLab instances running version => 15.0.
- LFS objects created while importing a project with object storage enabled and the GitLab managed replication disabled due to how we import those files.
- If a customer is using the GitLab-managed replication to sync object storage, then disables it. The next time
Geo::RegistryConsistenceWorker
runs, it removes the registry rows and the file on object storage.