Geo file downloads can block Sidekiq threads
Summary
FileDownloadService
calls that do not finish can get Sidekiq workers hanging, eventually blocking all threads.
We use ::HTTP
(the httprb gem) in Gitlab::Geo::Replication::BaseTransfer
and do not use a timeout. It seems like httprb, compared to Net::HTTP doesn't set a default ReadTimeout per packet (the Net::HTTP default is 60s).
The httprb gem uses readpartial
which is potentially blocking, if there's no data available:
readpartial is designed for streams such as pipe, socket, tty, etc. It blocks only when no data immediately available. This means that it blocks only when following all conditions hold.
Steps to reproduce
I've not got a good way for consistently reproducing yet.
What is the current bug behavior?
Requests for FileDownload from the Geo primary can get stuck, leading to multiple Sidekiq workers hanging
What is the expected correct behavior?
We should have a sort of read timeout so we don't block when there's no data available or something goes wrong
Relevant logs and/or screenshots
Most relevant backtrace from a Sidekiq thread dump with kill -TTIN
:
/opt/gitlab/embedded/lib/ruby/2.6.0/openssl/buffering.rb:125:in `sysread'
/opt/gitlab/embedded/lib/ruby/2.6.0/openssl/buffering.rb:125:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/timeout/null.rb:45:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/connection.rb:212:in `read_more'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/connection.rb:92:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/response/body.rb:30:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/response/body.rb:36:in `each'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/replication/base_transfer.rb:137:in `download_file'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/replication/base_transfer.rb:64:in `download_from_primary'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/replication/job_artifact_downloader.rb:19:in `execute'
/opt/gitlab/embedded/service/gitlab-rails/ee/app/services/geo/file_download_service.rb:20:in `block in execute'
/opt/gitlab/embedded/service/gitlab-rails/app/services/concerns/exclusive_lease_guard.rb:29:in `try_obtain_lease'
/opt/gitlab/embedded/service/gitlab-rails/ee/app/services/geo/file_download_service.rb:17:in `execute'
/opt/gitlab/embedded/service/gitlab-rails/ee/app/workers/geo/file_download_worker.rb:11:in `perform'
Also a symptom, but not the main cause is the FileDownloadDispatchWorker looping with no further enqueues:
"message":"Loop 3496","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3497","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3498","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3499","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3500","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3501","enqueued":0,"pending":980,"scheduled":10,"capacity":10
Possible fixes
httprb seems to support per operation timeouts so that may be useful to investigate here