Geo: Bug: FileDownloadDispatchWorker may sometimes excessively resync files, and block new files from being synced until the next hour
Summary
In certain circumstances, FileDownloadDispatchWorker
goes into an almost infinite loop of syncing and resyncing files that are already synced.
Steps to reproduce
- Set up 2 fresh instances with
ansible-gitlab-geo
on GitLab Version 12.7.6-ee - Create a project
- Clone it
git lfs install
git lfs track "*.txt"
- Create 10 new .txt files (because File sync capacity defaults to 10)
create_n_txt_files() { for n in {1..$1}; do echo "Hello $n" > "File$n.txt" done } create_n_txt_files 10
git add . && git commit -m "Add 10 LFS files" && git push
This is reproducible locally on master (which is working on 12.9) as well. FileDownloadDispatchWorker
uses a different query if the feature geo_lfs_registry_ssot_sync
is enabled or disabled, but it does the infinite loop behavior either way.
What is the current bug behavior?
- SSH into the secondary and
tail -f /var/log/gitlab/gitlab-rails/geo.log | grep File
- Observe an unending loop of
FileDownloadDispatchWorker
enqueuingFileDownloadService
for IDs that are already synced.- Until
FileDownloadDispatchWorker
's 1 hour timeout is reached
- Until
What is the expected correct behavior?
- A single batch of 10
FileDownloadService
log lines with messageFile download
Relevant logs and/or screenshots
{"severity":"INFO","time":"2020-02-24T08:46:47.272Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadDispatchWorker","message":"Loop 17","job_id":"ffbc8e9ed85831d8c7f8f0ad","enqueued":10,"pending":0,"scheduled":10,"capacity":10}
{"severity":"INFO","time":"2020-02-24T08:46:47.291Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":105,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.087}
{"severity":"INFO","time":"2020-02-24T08:46:47.294Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":106,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.088}
{"severity":"INFO","time":"2020-02-24T08:46:47.310Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":111,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.106}
{"severity":"INFO","time":"2020-02-24T08:46:47.331Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":114,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.096}
{"severity":"INFO","time":"2020-02-24T08:46:47.338Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":108,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.109}
{"severity":"INFO","time":"2020-02-24T08:46:47.345Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":110,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.121}
{"severity":"INFO","time":"2020-02-24T08:46:47.346Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":109,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.091}
{"severity":"INFO","time":"2020-02-24T08:46:47.348Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":112,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.084}
{"severity":"INFO","time":"2020-02-24T08:46:47.377Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":107,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.108}
{"severity":"INFO","time":"2020-02-24T08:46:47.392Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":113,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.094}
{"severity":"INFO","time":"2020-02-24T08:46:48.361Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadDispatchWorker","message":"Loop 18","job_id":"ffbc8e9ed85831d8c7f8f0ad","enqueued":10,"pending":0,"scheduled":10,"capacity":10}
{"severity":"INFO","time":"2020-02-24T08:46:48.406Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":106,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.087}
{"severity":"INFO","time":"2020-02-24T08:46:48.421Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":105,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.106}
{"severity":"INFO","time":"2020-02-24T08:46:48.423Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":110,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.099}
{"severity":"INFO","time":"2020-02-24T08:46:48.426Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":108,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.103}
{"severity":"INFO","time":"2020-02-24T08:46:48.431Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":114,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.107}
{"severity":"INFO","time":"2020-02-24T08:46:48.438Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":107,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.061}
{"severity":"INFO","time":"2020-02-24T08:46:48.470Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":112,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.093}
{"severity":"INFO","time":"2020-02-24T08:46:48.474Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":113,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.098}
{"severity":"INFO","time":"2020-02-24T08:46:48.483Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":111,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.162}
{"severity":"INFO","time":"2020-02-24T08:46:48.485Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":109,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.113}
{"severity":"INFO","time":"2020-02-24T08:46:49.470Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadDispatchWorker","message":"Loop 19","job_id":"ffbc8e9ed85831d8c7f8f0ad","enqueued":10,"pending":0,"scheduled":10,"capacity":10}
{"severity":"INFO","time":"2020-02-24T08:46:49.475Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":106,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.067}
{"severity":"INFO","time":"2020-02-24T08:46:49.484Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":111,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.069}
{"severity":"INFO","time":"2020-02-24T08:46:49.518Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":105,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.1}
{"severity":"INFO","time":"2020-02-24T08:46:49.519Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":108,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.084}
{"severity":"INFO","time":"2020-02-24T08:46:49.521Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":110,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.081}
{"severity":"INFO","time":"2020-02-24T08:46:49.522Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":114,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.085}
{"severity":"INFO","time":"2020-02-24T08:46:49.543Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":107,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.069}
{"severity":"INFO","time":"2020-02-24T08:46:49.557Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":112,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.089}
{"severity":"INFO","time":"2020-02-24T08:46:49.562Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":113,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.075}
{"severity":"INFO","time":"2020-02-24T08:46:49.570Z","correlation_id":"2ad51c871d2bcd529cb6e3a03fb80672","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":109,"message":"File download","mark_as_synced":true,"download_success":true,"bytes_downloaded":10,"failed_before_transfer":false,"primary_missing_file":false,"download_time_s":0.098}
- I have found this get to Loop 3279 on one instance before it hit the 1 hour timeout.
- If you push below the batch size, it does not trigger the loop.
- If you push new files while it's in this loop, it does not sync those.
- If you push 100 new LFS files while
FileDownloadDispatchWorker
is between runs because there wasn't much LFS activity, when it starts up again, it actually cycles through all 100 of the new files before starting from the beginning of the 100 files again.- Extrapolating from this, I expect LFS syncing becomes semi-up-to-date every hour, and if you have very low LFS activity for a couple hours, overnight for example, then the secondaries have a chance to break out of this behavior until somebody pushes 10 or more files between runs. This is probably why this bug has not been reported. It is typically not permanent, and while it's happening it gets near synced every hour. For this same reason, I think we can bump this down to ~S3. But I'd do ~P2.
- If you restart sidekiq, it finishes that group of files that it was stuck on, but if there are 10 or more files already pending to be synced after that, then it will get stuck looping on those.
Results of GitLab environment info
A fresh Geo setup on GitLab Version 12.7.6-ee. This happens on newer versions too.
Possible fixes
Unconfirmed suspicion:
I think the LfsObjectRegistryFinder#find_unsynced
is returning the same results during each FileDownloadDispatchWorker#load_pending_resources
. The condition that is supposed to ensure these results are excluded is missing_registry
at https://gitlab.com/gitlab-org/gitlab/-/blob/v12.7.6-ee/ee/app/finders/geo/lfs_object_registry_finder.rb#L44. I think the scopes are correct. I think the database data is correct. So my suspicion is the problem is in the area of the ORM layer, or how we are using FDW, or how we are handling database connections, etc.
I have a feeling the root cause is the same as what blocked me in !20002 (comment 266285571).