Skip to content
Snippets Groups Projects
Closed Geo: invalid lfs object deletion on secondary when managed object replication is disabled
  • View options
  • Geo: invalid lfs object deletion on secondary when managed object replication is disabled

  • View options
  • Closed Issue created by Diana Stanley

    Summary

    Customer has configured a primary and two secondary GitLab instances with Geo. They are using object storage in S3 for LFS objects, but are currently using AWS replication for the storage buckets rather than GitLab managed replication of the storage buckets. What they are seeing is that objects are replicated to the secondary bucket by AWS replication and then they are deleted by GitLab's Geo::RegistryConsistencyWorker on the secondary host.

    Sidekiq json output { "id": "AQAAAYKyjVHpQIyyCQAAAABBWUt5alZwN0FBQWlVR0VrSGVGbGhnQXk", "content": { "timestamp": "2022-08-18T20:03:02.249Z", "tags": [ "filename:current", "service:sidekiq", "source:gitlab", "availability-zone:us-east-1c", "environment:tools", "gitlab_geo_deployment:secondary", "gitlab_geo_full_role:geo_secondary_site_sidekiq_secondary", "gitlab_geo_site:geo_secondary_site", "gitlab_node_level:sidekiq_secondary", "gitlab_node_prefix:gitlab", "gitlab_node_type:sidekiq", "iam_profile:gitlab-ec2-s3-access-profile", "image:REDACTED", "instance-type:m5.2xlarge", "kernel:none", "name:gitlab-tools-ue1-sidekiq-3", "region:us-east-1", "ec2-security-backup:true", "security-group:REDACTED", "terraform:true", "terraformmodule:gitlab-compute-resources:1.0.2", "terragruntmodule:tools/us-east-1/gitlab-onprem/gitlab-sidekiq-ec2" ], "host": "gitlab-tools-ue1-sidekiq-3", "service": "sidekiq", "message": "Geo::DestroyWorker JID-7e96c53d57113e8071439d20: done: 0.386539 sec", "attributes": { "service_name": "sidekiq", "pid": 161340, "dc": { "correlation_id": "c3609ea7af439c1ead4ed4bf9a45fa40" }, "db_main_wal_cached_count": 0, "redis_duration_s": 0.003785, "db_write_count": 3, "version": 0, "redis_queues_read_bytes": 10, "db_main_duration_s": 0.002, "mem_bytes": 4366512, "db_main_replica_wal_cached_count": 0, "enqueued_at": "2022-08-18T20:03:01.861Z", "redis_shared_state_read_bytes": 2, "db_primary_wal_count": 0, "job_size_bytes": 21, "correlation_id": "c3609ea7af439c1ead4ed4bf9a45fa40", "meta": { "client_id": "ip/", "feature_category": "geo_replication", "caller_id": "Geo::Secondary::RegistryConsistencyWorker" }, "redis_read_bytes": 418, "redis_cache_duration_s": 0.001441, "mem_objects": 35863, "idempotency_key": "resque:gitlab:duplicate:geo:geo_destroy:c526d80de74aded0756b9e716e855aeaaf47de5aa70025c2a4b5b8a199b8c919", "db_main_wal_count": 0, "db_primary_cached_count": 0, "job_status": "done", "queue": "geo:geo_destroy", "db_replica_wal_count": 0, "retry": 3, "redis_calls": 5, "severity": "INFO", "args": [ "lfs_object", "108934" ], "worker_data_consistency": "always", "db_replica_duration_s": 0, "redis_queues_duration_s": 0.000404, "db_primary_duration_s": 0.007, "completed_at": "2022-08-18T20:03:02.249Z", "class": "Geo::DestroyWorker", "redis_queues_calls": 1, "db_main_replica_wal_count": 0, "db_primary_count": 5, "db_main_replica_duration_s": 0, "redis_shared_state_write_bytes": 332, "db_duration_s": 0.006393, "scheduling_latency_s": 0.002136, "db_main_cached_count": 0, "db_main_count": 1, "jid": "7e96c53d57113e8071439d20", "external_http_duration_s": 0.30817187190405093, "redis_shared_state_calls": 2, "redis_cache_write_bytes": 129, "db_main_replica_count": 0, "db_replica_count": 0, "db_replica_wal_cached_count": 0, "cpu_s": 0.050792, "load_balancing_strategy": "primary", "size_limiter": "validated", "mem_mallocs": 16594, "external_http_count": 9, "db_cached_count": 0, "redis_shared_state_duration_s": 0.00194, "db_replica_cached_count": 0, "mem_total_bytes": 5801032, "queue_namespace": "geo", "redis_queues_write_bytes": 328, "duration_s": 0.386539, "redis_cache_calls": 2, "time": "2022-08-18T20:03:02.249Z", "redis_write_bytes": 789, "db_count": 5, "redis_cache_read_bytes": 406, "db_main_replica_cached_count": 0, "created_at": "2022-08-18T20:03:01.837Z", "db_primary_wal_cached_count": 0 } } }

    This seems to be associated with a bug while finding the LFS Object for the current secondary site in https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/ee/lfs_object.rb#L43-47 when object storage is enabled, and GitLab's managed object replication is disabled. When sync object storage is disabled, we return only LFS object whether the file is stored locally, and this relation will not include the new LFS object that is on object storage. So, the Geo::RegistryConsistenceService triggers the deletion.

    Workaround

    Use GitLab managed object replication

    Root cause

    From !95937 (comment 1073517003):

    The customer is experiencing the bug described in #371397 (closed) while importing Bitbucket projects. This happens because during the LFS objects import phase we:

    1. Download the file into a temporary folder
    2. Create an entry in the lfs_objects table with file_store = 1
    3. Move the file to the object store
    4. Update the file_store column

    Since LfsObject.replicables_for_current_seconndary return only LFS objects whether the file is stored locally (file_store = 1) when the GitLab managed-replication is disabled, this relation will include the LFS object created in step 2 above and the Geo::RegistryConsistenceWorker will create a registry in the tracking database. The next time Geo::RegistryConsistenceWorker runs, it removes the registry rows improperly created due to the workflow described above. We can validate this locally following steps 1-11 in !95937 (merged), then:

    1. In a Rails console session on your primary site, create an LFS object:

      > LfsObject.safe_find_or_create_by!(oid: 'f2b0a1e7550e9b718dafc9b525a04879a766de62e4fbdfc46593d47f7ab74642', size: 10.kilobytes)
      => #<LfsObject:0x00000001186819b0
      id: 6,
      oid: "f2b0a1e7550e9b718dafc9b525a04879a766de62e4fbdfc46593d47f7ab74642",
      size: 51200,
      created_at: Mon, 22 Aug 2022 23:32:13.297947000 UTC +00:00,
      updated_at: Mon, 22 Aug 2022 23:32:13.297947000 UTC +00:00,
      file: "a1e7550e9b718dafc9b525a04879a766de62e4fbdfc46593d47f7ab74642",
      file_store: 1,
      verification_checksum: nil>
    2. In a Rails console session on your secondary site, check the latest Geo::LfsObjectRegistry record:

     > Geo::LfsObjectRegistry.last
     => #<Geo::LfsObjectRegistry:0x0000000130853000
      id: 6,
      created_at: Mon, 22 Aug 2022 23:32:16.167890000 UTC +00:00,
      retry_at: Mon, 22 Aug 2022 23:33:01.291105000 UTC +00:00,
      bytes: nil,
      lfs_object_id: 6,
      retry_count: 1,
      missing_on_primary: false,
      success: false,
      sha256: nil,
      state: 3,
      last_synced_at: Mon, 22 Aug 2022 23:32:16.166723000 UTC +00:00,
      last_sync_failure: "The file is missing on the Geo primary site",
      verification_started_at: nil,
      verified_at: nil,
      verification_retry_at: nil,
      verification_retry_count: 0,
      verification_state: 0,
      checksum_mismatch: false,
      verification_checksum: nil,
      verification_checksum_mismatched: nil,
      verification_failure: nil>

    That's why we also have the same check introduced in this MR in the GitLab::Geo::Replication to avoid downloading unnecessary LFS objects.

    If someone wants to go deeper into the current implementation, follow the method calls called in the workflow above:

    1. Projects::ImportService#execute
    2. Projects::ImportService#download_lfs_objects
    3. Projects::LfsPointers::LfsImportService.new(project).execute
    4. Projects::LfsDownloadService.new(project, lfs_download_object).execute
    5. Projects::LfsDownloadService.find_or_create_lfs_object
    6. LfsObject.safe_find_or_create_by!(oid: lfs_oid, size: lfs_size)
    7. lfs_obj.update!(file: tmp_file) unless lfs_obj.file.file

    The bug was introduced in !80007 (merged) and only affects the following scenarios on GitLab instances running version => 15.0.

    1. LFS objects created while importing a project with object storage enabled and the GitLab managed replication disabled due to how we import those files.
    2. If a customer is using the GitLab-managed replication to sync object storage, then disables it. The next time Geo::RegistryConsistenceWorker runs, it removes the registry rows and the file on object storage.
    Edited by Michael Kozono

    Linked items 0

  • Link items together to show that they're related or that one is blocking others.

    Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first