Geo: Redownloading repositories via snapshot does not work
Summary
Redownloading repositories via snapshot may not work as intended. The flow looks like:
-
Download a snapshot of the repo to a temp repo
- Gitaly downloads a snapshot of the repo to yet another temp repo, then moves that to the first temp repo (by the way, I tried calling
create_from_snapshot
on the canonical repo to avoid an extra temp repo step, but Gitaly'sCreateRepositoryFromSnapshot
errors withdestination directory exists
)
- Gitaly downloads a snapshot of the repo to yet another temp repo, then moves that to the first temp repo (by the way, I tried calling
- If there was no error during
fetch_snapshot
, then we exitredownload_repository
and ensure toclean_up_temporary_repository
, which deletes the temp repo
It looks like the canonical repository does not get touched?
Steps to reproduce
- In Rails console on the secondary
r = Geo::ProjectRegistry.last
r.flag_repository_for_redownload!
-
cd
to wheregeo.log
is tail -f geo.log
- In the primary UI, add a file to that project repo to 1) trigger a sync and 2) if the logs say it succeeded, we can check the secondary repo to see if in fact it is up-to-date.
{"severity":"INFO","time":"2020-03-10T21:14:46.104Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Started repository sync","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.105Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Marking repository sync as started","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.112Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Trying to fetch repository","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.115Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Redownloading repository","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.115Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Attempting to fetch repository via snapshot","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
Locally, I added log output to `clean_up_temporary_repository` which we see in the next line. I had also added logging to `set_temp_repository_as_main` which we do not see here.
{"severity":"INFO","time":"2020-03-10T21:14:46.341Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Temporary repository found, cleaning up...","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.666Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Marking repository sync as successful","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.674Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Finished repository sync","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2,"update_delay_s":185.118,"download_time_s":0.561}
{"severity":"INFO","time":"2020-03-10T21:14:46.674Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Expiring caches","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:14:46.732Z","correlation_id":"7dd185e39827e70e3efb4461dcd7662a","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Finished repository sync","job_id":"c1930579cb5fb95a9ca492fc","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
And the repo is still out-of-date.
But since there was no error, it sets force_to_redownload_repository
to false
. The next sync then uses the normal git fetch
flow and which successfully updates the repo.
{"severity":"INFO","time":"2020-03-10T21:15:13.778Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Started repository sync","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:15:13.778Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Marking repository sync as started","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:15:13.785Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Trying to fetch repository","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:15:14.703Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Marking repository sync as successful","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:15:14.709Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Finished repository sync","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2,"update_delay_s":213.154,"download_time_s":0.924}
{"severity":"INFO","time":"2020-03-10T21:15:14.710Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Expiring caches","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:15:14.873Z","correlation_id":"5fab1a22244dd003eb08c6398480398d","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Finished repository sync","job_id":"56434c623ad59559282a13ed","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
In this case, we eventually end up with an up-to-date repo and a working sync.
Also note, fetch_snapshot
rescues StandardError
and we fallback to typical git fetch
if an error occurs in fetch_snapshot
. I believe this causes more cases to end up with a working sync. For example, if you remove the repository directory from the secondary, or if you mangle the repo in the secondary, Gitaly errors during the snapshot RPC, and we end up doing a fallback git fetch
.
Therefore, this usually isn't a persistent problem.
However, @cat had a customer who encountered repos that somehow could not leave the redownload
sync flow. In that case, when force_to_redownload_repository
was manually set to false
, the repos could successfully sync again.
What is the current bug behavior?
- Snapshot flow never truly succeeds.
- Furthermore, since redownload is the intended flow when retry_count is 5+, it is possible for repos to get stuck in the redownload flow and never sync.
What is the expected correct behavior?
- Snapshot flow should succeed.
Relevant logs and/or screenshots
(Paste any relevant logs - please use code blocks (```) to format console output, logs, and code as it's tough to read otherwise.)
Results of GitLab environment info
Tested locally on master branch. At least one customer was affected as well.
Possible fixes
The original implementation called set_temp_repository_as_main
after fetch_snapshot
, so the bug was introduced later.
I tried changing
return if fetch_snapshot
to
if fetch_snapshot
set_temp_repository_as_main
return
end
And set_temp_repository_as_main
successfully completed, but there is some other problem.
Gitaly errored with 13:exit status 128
and was rescued in rescue Gitlab::Shell::Error, Gitlab::Git::BaseError => e
https://gitlab.com/gitlab-org/gitlab/-/blob/v12.8.5-ee/ee/app/services/geo/repository_sync_service.rb#L14.
I added additional logging to set_temp_repository_as_main
:
{"severity":"INFO","time":"2020-03-10T21:03:54.226Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Started repository sync","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.226Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Marking repository sync as started","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.233Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Trying to fetch repository","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.237Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Redownloading repository","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.237Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Attempting to fetch repository via snapshot","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.426Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Setting newly downloaded repository as main","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2,"storage_shard":"default","temp_path":"@geo-temporary/@hashed/e3/d6/e3d6c4d4599e00882384ca981ee287ed961fa5f3828e2adb5e9ea890ab0d0525","deleted_disk_path_temp":"@failed-geo-sync/@hashed/e3/d6/e3d6c4d4599e00882384ca981ee287ed961fa5f3828e2adb5e9ea890ab0d0525","disk_path":"@hashed/e3/d6/e3d6c4d4599e00882384ca981ee287ed961fa5f3828e2adb5e9ea890ab0d0525"}
{"severity":"INFO","time":"2020-03-10T21:03:54.433Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Original repository exists, moving it out of the way...","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.436Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Moving temporary repository to the canonical path...","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.437Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Purging original repository...","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.442Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Successfully set_temp_repository_as_main","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"ERROR","time":"2020-03-10T21:03:54.479Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Error syncing repository","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2,"error":"13:exit status 128"}
{"severity":"INFO","time":"2020-03-10T21:03:54.495Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Expiring caches","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
{"severity":"INFO","time":"2020-03-10T21:03:54.669Z","correlation_id":"7fc7a94f4e400f48e8f9cadb7359d72c","class":"Geo::RepositorySyncService","host":"127.0.0.1","message":"Finished repository sync","job_id":"57d1b98cf67d3c731e942386","project_id":94,"project_path":"gitlab-qa-sandbox-group/qa-test-2020-02-06-15-31-27-35a7b9f7c4c4483f/geo-after-rename-14df8209e611ddfa","storage_version":2}
Alternate fix
Update: I initially thought perhaps we could remove the whole "redownload" flow, but some repos are probably getting synced by git fetch
into a bare repo inside the "redownload" flow, so we can only safely remove snapshot. Here's a WIP: !31935 (closed)
So I prefer to attempt to fix snapshot flow first, and only remove it as a last resort.
If it's true that the snapshot flow never succeeds, then it seems to be proven unnecessary. We could remove a decent amount of some complexity.
Here are the original issues:
Here is the initial MR:
Severity
For customers that have repos stuck in the redownload flow, it seems like ~S2, workaround too complex: Assuming you successfully diagnose the problem, you can set retry_count
to 1
and force_to_redownload_repository
to false
in console to get it to use the git fetch
flow.
Fortunately, this bug doesn't seem to impact most customers. More customers are seeing this, see the comments.