Object deduplication on Geo secondary not happening if the first sync fails
In #328295 (closed), we found that, if the first sync fails (or the primary repository is missing) for a repository on a Geo secondary, we will not attempt to run the object deduplication until at least gc
syncs happen (200 syncs by default). This can lead to a massive accumulation of disk data that could be deduplicated in case of large projects being forked.
Copying the content of the comment in this issue:
Added way more verbose logging while testing:
Expand for changes
diff --git a/app/services/projects/git_deduplication_service.rb b/app/services/projects/git_deduplication_service.rb
index 74d469ecf37..52e7abff8c6 100644
--- a/app/services/projects/git_deduplication_service.rb
+++ b/app/services/projects/git_deduplication_service.rb
@@ -14,6 +14,7 @@ def initialize(project)
end
def execute
+ Gitlab::Geo::Logger.info("called dedupserv")
try_obtain_lease do
unless project.has_pool_repository?
disconnect_git_alternates
diff --git a/app/workers/repository_fork_worker.rb b/app/workers/repository_fork_worker.rb
index 5ec9ceaf004..5fdefb5d52e 100644
--- a/app/workers/repository_fork_worker.rb
+++ b/app/workers/repository_fork_worker.rb
@@ -12,6 +12,7 @@ class RepositoryForkWorker # rubocop:disable Scalability/IdempotentWorker
feature_category :source_code_management
def perform(*args)
+ sleep(180)
target_project_id = args.shift
target_project = Project.find(target_project_id)
diff --git a/ee/app/services/geo/project_housekeeping_service.rb b/ee/app/services/geo/project_housekeeping_service.rb
index a4221332121..5c7cb65a5cd 100644
--- a/ee/app/services/geo/project_housekeeping_service.rb
+++ b/ee/app/services/geo/project_housekeeping_service.rb
@@ -21,6 +21,7 @@ def initialize(project, new_repository: false)
def execute
increment!
+ Gitlab::Geo::Logger.info("new_repo? #{@new_repository}; needed? #{needed?};")
do_housekeeping if needed?
end
@@ -52,6 +53,7 @@ def do_housekeeping
end
def execute_gitlab_shell_gc(lease_uuid)
+ Gitlab::Geo::Logger.info("performing gitgc with #{project.id}, task #{task}")
::Projects::GitGarbageCollectWorker.perform_async(project.id, task, lease_key, lease_uuid)
ensure
if should_reset?
@@ -127,10 +129,12 @@ def create_object_pool_on_secondary
end
def create_object_pool_on_secondary?
+ Gitlab::Geo::Logger.info("called create_object_pool_on_secondary?")
return unless ::Gitlab::Geo.secondary?
return unless project.object_pool_missing?
return unless pool_repository.source_project_repository.exists?
+ Gitlab::Geo::Logger.info("creating object_pool_on_secondary")
true
end
end
diff --git a/ee/app/services/geo/repository_sync_service.rb b/ee/app/services/geo/repository_sync_service.rb
index 36e5023230f..f7940bbd245 100644
--- a/ee/app/services/geo/repository_sync_service.rb
+++ b/ee/app/services/geo/repository_sync_service.rb
@@ -49,6 +49,7 @@ def ensure_repository
end
def execute_housekeeping
+ log_info("Executing housekeeping for project #{project.id}, new repo? #{new_repository?}")
Geo::ProjectHousekeepingService.new(project, new_repository: new_repository?).execute
end
end
So far, the flow looks like:
Click to expand flow
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.044Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"class": "Geo::RepositorySyncService",
"host": "gdk.test",
"message": "Trying to fetch repository",
"project_id": 32,
"project_path": "root/html-5-boilerplate-4",
"storage_version": 2
}
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.721Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"class": "Geo::RepositorySyncService",
"host": "gdk.test",
"message": "Repository is not found, marking it as successfully synced",
"project_id": 32,
"project_path": "root/html-5-boilerplate-4",
"storage_version": 2
}
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.721Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"class": "Geo::RepositorySyncService",
"host": "gdk.test",
"message": "Marking repository sync as successful",
"project_id": 32,
"project_path": "root/html-5-boilerplate-4",
"storage_version": 2
}
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.763Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"class": "Geo::RepositorySyncService",
"host": "gdk.test",
"message": "Finished repository sync",
"project_id": 32,
"project_path": "root/html-5-boilerplate-4",
"storage_version": 2,
"update_delay_s": 34.861,
"download_time_s": 0.717
}
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.764Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"class": "Geo::RepositorySyncService",
"host": "gdk.test",
"message": "Expiring caches",
"project_id": 32,
"project_path": "root/html-5-boilerplate-4",
"storage_version": 2
}
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.869Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"class": "Geo::RepositorySyncService",
"host": "gdk.test",
"message": "Executing housekeeping",
"project_id": 32,
"project_path": "root/html-5-boilerplate-4",
"storage_version": 2
}
{
"severity": "INFO",
"time": "2021-10-18T11:03:56.903Z",
"correlation_id": "674dcfc27fcafa21ebb4975c311efc22",
"message": "new_repo? false; needed? false;"
}
Then, the next time it syncs, it's not a new_repository anymore, so it won't create the pool repo until the next gc
period, when it calls housekeeping with the gc
task and realizes it did not create one.
I believe the problem is that we set the new_repository
flag only after the we attempt the fetch_geo_mirror
, which means if this fails, new_repository
will be set to false. The fix might be as simple as moving the @new_repository = true
assignment upwards before we know it might fail, needs double checking what happens when attempting to link a non-existing repo to a pool.