Object deduplication on Geo secondary not happening if the first sync fails

In #328295 (closed), we found that, if the first sync fails (or the primary repository is missing) for a repository on a Geo secondary, we will not attempt to run the object deduplication until at least gc syncs happen (200 syncs by default). This can lead to a massive accumulation of disk data that could be deduplicated in case of large projects being forked.

Copying the content of the comment in this issue:

Added way more verbose logging while testing:

Expand for changes
diff --git a/app/services/projects/git_deduplication_service.rb b/app/services/projects/git_deduplication_service.rb
index 74d469ecf37..52e7abff8c6 100644
--- a/app/services/projects/git_deduplication_service.rb
+++ b/app/services/projects/git_deduplication_service.rb
@@ -14,6 +14,7 @@ def initialize(project)
     end
 
     def execute
+      Gitlab::Geo::Logger.info("called dedupserv")
       try_obtain_lease do
         unless project.has_pool_repository?
           disconnect_git_alternates
diff --git a/app/workers/repository_fork_worker.rb b/app/workers/repository_fork_worker.rb
index 5ec9ceaf004..5fdefb5d52e 100644
--- a/app/workers/repository_fork_worker.rb
+++ b/app/workers/repository_fork_worker.rb
@@ -12,6 +12,7 @@ class RepositoryForkWorker # rubocop:disable Scalability/IdempotentWorker
   feature_category :source_code_management
 
   def perform(*args)
+    sleep(180)
     target_project_id = args.shift
     target_project = Project.find(target_project_id)
 
diff --git a/ee/app/services/geo/project_housekeeping_service.rb b/ee/app/services/geo/project_housekeeping_service.rb
index a4221332121..5c7cb65a5cd 100644
--- a/ee/app/services/geo/project_housekeeping_service.rb
+++ b/ee/app/services/geo/project_housekeeping_service.rb
@@ -21,6 +21,7 @@ def initialize(project, new_repository: false)
 
     def execute
       increment!
+      Gitlab::Geo::Logger.info("new_repo? #{@new_repository}; needed? #{needed?};")
       do_housekeeping if needed?
     end
 
@@ -52,6 +53,7 @@ def do_housekeeping
     end
 
     def execute_gitlab_shell_gc(lease_uuid)
+      Gitlab::Geo::Logger.info("performing gitgc with #{project.id}, task #{task}")
       ::Projects::GitGarbageCollectWorker.perform_async(project.id, task, lease_key, lease_uuid)
     ensure
       if should_reset?
@@ -127,10 +129,12 @@ def create_object_pool_on_secondary
     end
 
     def create_object_pool_on_secondary?
+      Gitlab::Geo::Logger.info("called create_object_pool_on_secondary?")
       return unless ::Gitlab::Geo.secondary?
       return unless project.object_pool_missing?
       return unless pool_repository.source_project_repository.exists?
 
+      Gitlab::Geo::Logger.info("creating object_pool_on_secondary")
       true
     end
   end
diff --git a/ee/app/services/geo/repository_sync_service.rb b/ee/app/services/geo/repository_sync_service.rb
index 36e5023230f..f7940bbd245 100644
--- a/ee/app/services/geo/repository_sync_service.rb
+++ b/ee/app/services/geo/repository_sync_service.rb
@@ -49,6 +49,7 @@ def ensure_repository
     end
 
     def execute_housekeeping
+      log_info("Executing housekeeping for project #{project.id}, new repo? #{new_repository?}")
       Geo::ProjectHousekeepingService.new(project, new_repository: new_repository?).execute
     end
   end

So far, the flow looks like:

Click to expand flow
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.044Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "class": "Geo::RepositorySyncService",
  "host": "gdk.test",
  "message": "Trying to fetch repository",
  "project_id": 32,
  "project_path": "root/html-5-boilerplate-4",
  "storage_version": 2
}
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.721Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "class": "Geo::RepositorySyncService",
  "host": "gdk.test",
  "message": "Repository is not found, marking it as successfully synced",
  "project_id": 32,
  "project_path": "root/html-5-boilerplate-4",
  "storage_version": 2
}
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.721Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "class": "Geo::RepositorySyncService",
  "host": "gdk.test",
  "message": "Marking repository sync as successful",
  "project_id": 32,
  "project_path": "root/html-5-boilerplate-4",
  "storage_version": 2
}
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.763Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "class": "Geo::RepositorySyncService",
  "host": "gdk.test",
  "message": "Finished repository sync",
  "project_id": 32,
  "project_path": "root/html-5-boilerplate-4",
  "storage_version": 2,
  "update_delay_s": 34.861,
  "download_time_s": 0.717
}
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.764Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "class": "Geo::RepositorySyncService",
  "host": "gdk.test",
  "message": "Expiring caches",
  "project_id": 32,
  "project_path": "root/html-5-boilerplate-4",
  "storage_version": 2
}
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.869Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "class": "Geo::RepositorySyncService",
  "host": "gdk.test",
  "message": "Executing housekeeping",
  "project_id": 32,
  "project_path": "root/html-5-boilerplate-4",
  "storage_version": 2
}
{
  "severity": "INFO",
  "time": "2021-10-18T11:03:56.903Z",
  "correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
  "message": "new_repo? false; needed? false;"
}

Then, the next time it syncs, it's not a new_repository anymore, so it won't create the pool repo until the next gc period, when it calls housekeeping with the gc task and realizes it did not create one.

I believe the problem is that we set the new_repository flag only after the we attempt the fetch_geo_mirror, which means if this fails, new_repository will be set to false. The fix might be as simple as moving the @new_repository = true assignment upwards before we know it might fail, needs double checking what happens when attempting to link a non-existing repo to a pool.

Edited by Catalin Irimie