Skip to content

Geo file downloads can block Sidekiq threads

Summary

FileDownloadService calls that do not finish can get Sidekiq workers hanging, eventually blocking all threads.

We use ::HTTP (the httprb gem) in Gitlab::Geo::Replication::BaseTransfer and do not use a timeout. It seems like httprb, compared to Net::HTTP doesn't set a default ReadTimeout per packet (the Net::HTTP default is 60s).

The httprb gem uses readpartial which is potentially blocking, if there's no data available:

readpartial is designed for streams such as pipe, socket, tty, etc. It blocks only when no data immediately available. This means that it blocks only when following all conditions hold.

Steps to reproduce

I've not got a good way for consistently reproducing yet.

What is the current bug behavior?

Requests for FileDownload from the Geo primary can get stuck, leading to multiple Sidekiq workers hanging

What is the expected correct behavior?

We should have a sort of read timeout so we don't block when there's no data available or something goes wrong

Relevant logs and/or screenshots

Most relevant backtrace from a Sidekiq thread dump with kill -TTIN:

/opt/gitlab/embedded/lib/ruby/2.6.0/openssl/buffering.rb:125:in `sysread'
/opt/gitlab/embedded/lib/ruby/2.6.0/openssl/buffering.rb:125:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/timeout/null.rb:45:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/connection.rb:212:in `read_more'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/connection.rb:92:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/response/body.rb:30:in `readpartial'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/http-4.2.0/lib/http/response/body.rb:36:in `each'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/replication/base_transfer.rb:137:in `download_file'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/replication/base_transfer.rb:64:in `download_from_primary'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/gitlab/geo/replication/job_artifact_downloader.rb:19:in `execute'
/opt/gitlab/embedded/service/gitlab-rails/ee/app/services/geo/file_download_service.rb:20:in `block in execute'
/opt/gitlab/embedded/service/gitlab-rails/app/services/concerns/exclusive_lease_guard.rb:29:in `try_obtain_lease'
/opt/gitlab/embedded/service/gitlab-rails/ee/app/services/geo/file_download_service.rb:17:in `execute'
/opt/gitlab/embedded/service/gitlab-rails/ee/app/workers/geo/file_download_worker.rb:11:in `perform'

Also a symptom, but not the main cause is the FileDownloadDispatchWorker looping with no further enqueues:

"message":"Loop 3496","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3497","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3498","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3499","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3500","enqueued":0,"pending":980,"scheduled":10,"capacity":10
"message":"Loop 3501","enqueued":0,"pending":980,"scheduled":10,"capacity":10

Possible fixes

httprb seems to support per operation timeouts so that may be useful to investigate here

Edited by Catalin Irimie