Object deduplication on Geo secondary not happening if the first sync fails
In #328295, we found that, if the first sync fails (or the primary repository is missing) for a repository on a Geo secondary, we will not attempt to run the object deduplication until at least `gc` syncs happen (200 syncs by default). This can lead to a massive accumulation of disk data that could be deduplicated in case of large projects being forked. Copying the content of the comment in this issue: Added way more verbose logging while testing: <details> <summary>Expand for changes</summary> ```patch diff --git a/app/services/projects/git_deduplication_service.rb b/app/services/projects/git_deduplication_service.rb index 74d469ecf37..52e7abff8c6 100644 --- a/app/services/projects/git_deduplication_service.rb +++ b/app/services/projects/git_deduplication_service.rb @@ -14,6 +14,7 @@ def initialize(project) end def execute + Gitlab::Geo::Logger.info("called dedupserv") try_obtain_lease do unless project.has_pool_repository? disconnect_git_alternates diff --git a/app/workers/repository_fork_worker.rb b/app/workers/repository_fork_worker.rb index 5ec9ceaf004..5fdefb5d52e 100644 --- a/app/workers/repository_fork_worker.rb +++ b/app/workers/repository_fork_worker.rb @@ -12,6 +12,7 @@ class RepositoryForkWorker # rubocop:disable Scalability/IdempotentWorker feature_category :source_code_management def perform(*args) + sleep(180) target_project_id = args.shift target_project = Project.find(target_project_id) diff --git a/ee/app/services/geo/project_housekeeping_service.rb b/ee/app/services/geo/project_housekeeping_service.rb index a4221332121..5c7cb65a5cd 100644 --- a/ee/app/services/geo/project_housekeeping_service.rb +++ b/ee/app/services/geo/project_housekeeping_service.rb @@ -21,6 +21,7 @@ def initialize(project, new_repository: false) def execute increment! + Gitlab::Geo::Logger.info("new_repo? #{@new_repository}; needed? #{needed?};") do_housekeeping if needed? end @@ -52,6 +53,7 @@ def do_housekeeping end def execute_gitlab_shell_gc(lease_uuid) + Gitlab::Geo::Logger.info("performing gitgc with #{project.id}, task #{task}") ::Projects::GitGarbageCollectWorker.perform_async(project.id, task, lease_key, lease_uuid) ensure if should_reset? @@ -127,10 +129,12 @@ def create_object_pool_on_secondary end def create_object_pool_on_secondary? + Gitlab::Geo::Logger.info("called create_object_pool_on_secondary?") return unless ::Gitlab::Geo.secondary? return unless project.object_pool_missing? return unless pool_repository.source_project_repository.exists? + Gitlab::Geo::Logger.info("creating object_pool_on_secondary") true end end diff --git a/ee/app/services/geo/repository_sync_service.rb b/ee/app/services/geo/repository_sync_service.rb index 36e5023230f..f7940bbd245 100644 --- a/ee/app/services/geo/repository_sync_service.rb +++ b/ee/app/services/geo/repository_sync_service.rb @@ -49,6 +49,7 @@ def ensure_repository end def execute_housekeeping + log_info("Executing housekeeping for project #{project.id}, new repo? #{new_repository?}") Geo::ProjectHousekeepingService.new(project, new_repository: new_repository?).execute end end ``` </details> So far, the flow looks like: <details> <summary>Click to expand flow</summary> ```json { "severity": "INFO", "time": "2021-10-18T11:03:56.044Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "class": "Geo::RepositorySyncService", "host": "gdk.test", "message": "Trying to fetch repository", "project_id": 32, "project_path": "root/html-5-boilerplate-4", "storage_version": 2 } { "severity": "INFO", "time": "2021-10-18T11:03:56.721Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "class": "Geo::RepositorySyncService", "host": "gdk.test", "message": "Repository is not found, marking it as successfully synced", "project_id": 32, "project_path": "root/html-5-boilerplate-4", "storage_version": 2 } { "severity": "INFO", "time": "2021-10-18T11:03:56.721Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "class": "Geo::RepositorySyncService", "host": "gdk.test", "message": "Marking repository sync as successful", "project_id": 32, "project_path": "root/html-5-boilerplate-4", "storage_version": 2 } { "severity": "INFO", "time": "2021-10-18T11:03:56.763Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "class": "Geo::RepositorySyncService", "host": "gdk.test", "message": "Finished repository sync", "project_id": 32, "project_path": "root/html-5-boilerplate-4", "storage_version": 2, "update_delay_s": 34.861, "download_time_s": 0.717 } { "severity": "INFO", "time": "2021-10-18T11:03:56.764Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "class": "Geo::RepositorySyncService", "host": "gdk.test", "message": "Expiring caches", "project_id": 32, "project_path": "root/html-5-boilerplate-4", "storage_version": 2 } { "severity": "INFO", "time": "2021-10-18T11:03:56.869Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "class": "Geo::RepositorySyncService", "host": "gdk.test", "message": "Executing housekeeping", "project_id": 32, "project_path": "root/html-5-boilerplate-4", "storage_version": 2 } { "severity": "INFO", "time": "2021-10-18T11:03:56.903Z", "correlation_id": "674dcfc27fcafa21ebb4975c311efc22", "message": "new_repo? false; needed? false;" } ``` </details> Then, the next time it syncs, it's not a new_repository anymore, so it won't create the pool repo until the next `gc` period, when it calls housekeeping with the `gc` task and realizes it did not create one. I believe the problem is that we [set the `new_repository` flag](https://gitlab.com/gitlab-org/gitlab/blob/6e9dfb6aadf6a5c114fe8f018016e0410b7b8a05/ee/app/services/geo/repository_base_sync_service.rb#L64) only after the we attempt the `fetch_geo_mirror`, which means if this fails, `new_repository` will be set to false. The fix might be as simple as moving the `@new_repository = true` assignment upwards before we know it _might_ fail, needs double checking what happens when attempting to link a non-existing repo to a pool.
issue