I/O timeout when pushing LFS objects

Summary

We're still seeing some i/o timeout errors when uploading LFS objects on GitLab.com. We had a similar experience in this issue where fog-aws was patched, and we've had less reports of LFS upload issues on GitLab.com. However, it's still possible to replicate some timeout problems. In my particular case replicating, I was able to upload a 1GB .ISO object the first time successfully. I then deleted it from my repo and local working copy and then tried to upload it again. On the second try, I hit an i/o timeout and the upload took significantly longer.

10:08:04.914971 trace git-lfs: HTTP: PUT https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824
10:12:55.958684 trace git-lfs: tq: retrying object a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe: LFS: Put "https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824": read tcp 192.168.86.42:59291->172.65.251.78:443: i/o timeout
10:12:55.961220 trace git-lfs: tq: enqueue retry #2 after 0.50s for "a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe" (size: 1073741824)
10:12:56.463139 trace git-lfs: tq: sending batch of size 1
10:12:56.464339 trace git-lfs: ssh cache: git@gitlab.com git-lfs-authenticate gitlab-gold/cbledsoe-test/lfs-test.git upload
10:12:56.464499 trace git-lfs: api: batch 1 files
10:12:56.464662 trace git-lfs: HTTP: POST https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/info/lfs/objects/batch
10:12:56.825075 trace git-lfs: HTTP: 200
10:12:56.825728 trace git-lfs: HTTP: {"objects":[{"oid":"a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe","size":1073741824,"actions":{"upload":{"href":"https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824","header":{"Authorization":"Basic Y2xldmVsYW5kOmV5SmhiR2NpT2lKSVV6STFOaUlzSW5SNWNDSTZJa3BYVkNKOS5leUprWVhSaElqcDdJbUZqZEc5eUlqb2lZMnhsZG1Wc1lXNWtJbjBzSW1wMGFTSTZJamxtWmpjNVpXUTBMVFEzTlRNdE5EWm1PQzFpWWprd0xUa3dNelJoTXpOaU9UY3dN
10:12:56.825755 trace git-lfs: HTTP: aUlzSW1saGRDSTZNVFkyTVRRME1ETTNOaXdpYm1KbUlqb3hOall4TkRRd016Y3hMQ0psZUhBaU9qRTJOakUwTkRjMU56WjkubHhWRVpISWoyQUFUMVJWanRYSWNfbEhxaHd0RHdTME54STBqWWs3WmZiRQ==","Content-Type":"application/octet-stream","Transfer-Encoding":"chunked"}}}}]}
10:12:56.826491 trace git-lfs: tq: starting transfer adapter "basic"
10:12:56.828628 trace git-lfs: HTTP: PUT https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824
10:17:32.629218 trace git-lfs: HTTP: 200B | 4.1 MB/s
Uploading LFS objects: 100% (1/1), 3.2 GB | 4.1 MB/s, done.

Looking at workhorse logs, we can see a context canceled which seems to line up with the retries.

RequestBody: upload failed: PUT request "https://storage.googleapis.com/gitlab-gprd-lfs-objects/tmp/uploads/1661440080-54-0001-6169-137c9466a64fb57fa4035b0963548564?GoogleAccessId=gitlab-object-storage-prd@gitlab-production.iam.gserviceaccount.com&Signature=[FILTERED]&Expires=1661455380": Put "https://storage.googleapis.com/gitlab-gprd-lfs-objects/tmp/uploads/1661440080-54-0001-6169-137c9466a64fb57fa4035b0963548564?GoogleAccessId=gitlab-object-storage-prd@gitlab-production.iam.gserviceaccount.com&Signature=1jzTVD%2FtXy5b%2BK%2Fe2n%2FP%2Bt16iUOFLfWgVwFtURWb%2BlYREaK%2FFL%2BigOGVO4MT%0AlZ9xljkjOA18NirOKmlIdr8ossKXvDuM3MVN1uoI%2BY2wEdO4dafJ3gXoUzeD%0ANr%2FO2pwSHODpJ7kvtxUOrJ3RlHxL5dNKv0lSum2TKTcZF6rAkB3uLmAgaBDN%0A5Bk1YP5g3MI6oyKsB7JkPAajLpL4oS27mt45NDCPmB6sJPwFfkEb9HJgHOjB%0AQIQj4jg%2BTxDJ9qyV6nbmT25ZCcJHSbBFkwwurOSG2XwpNjwXxsPJYwAcogBr%0A2LmsYs9M9JKFkFRZNB3ireoQOUF1PjxzNKkTmLkAgg%3D%3D&Expires=1661455380": context canceled

When looking more broadly, we do see some of these errors occasionally. I was able to see the same behavior for a customer who was experiencing i/o timeouts as well:

500 Errors in workhorse (Internal)
Correlation Dashboard for one request (Internal)

Work in the following issues may help improve this behavior as we're upload to temporary GCS storage and copying from it:

Implement parallel composite uploads for Google Cloud Storage: #296931
Make object storage finalize copy to a background job: #296930

What is the current bug behavior?

Uploading LFS objects to a repository occasionally gives an i/o timeout and takes significantly longer to upload

What is the expected correct behavior?

Uploading LFS objects should not fail or result in a i/o timeout with multiple retries

I/O timeout when pushing LFS objects

Summary

What is the current bug behavior?

What is the expected correct behavior?

Output of checks

ZD Tickets:

Possible fixes