I/O timeout when pushing LFS objects
Summary
We're still seeing some i/o timeout
errors when uploading LFS objects on GitLab.com. We had a similar experience in this issue where fog-aws was patched, and we've had less reports of LFS upload issues on GitLab.com. However, it's still possible to replicate some timeout problems. In my particular case replicating, I was able to upload a 1GB .ISO object the first time successfully. I then deleted it from my repo and local working copy and then tried to upload it again. On the second try, I hit an i/o timeout
and the upload took significantly longer.
10:08:04.914971 trace git-lfs: HTTP: PUT https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824
10:12:55.958684 trace git-lfs: tq: retrying object a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe: LFS: Put "https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824": read tcp 192.168.86.42:59291->172.65.251.78:443: i/o timeout
10:12:55.961220 trace git-lfs: tq: enqueue retry #2 after 0.50s for "a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe" (size: 1073741824)
10:12:56.463139 trace git-lfs: tq: sending batch of size 1
10:12:56.464339 trace git-lfs: ssh cache: git@gitlab.com git-lfs-authenticate gitlab-gold/cbledsoe-test/lfs-test.git upload
10:12:56.464499 trace git-lfs: api: batch 1 files
10:12:56.464662 trace git-lfs: HTTP: POST https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/info/lfs/objects/batch
10:12:56.825075 trace git-lfs: HTTP: 200
10:12:56.825728 trace git-lfs: HTTP: {"objects":[{"oid":"a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe","size":1073741824,"actions":{"upload":{"href":"https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824","header":{"Authorization":"Basic Y2xldmVsYW5kOmV5SmhiR2NpT2lKSVV6STFOaUlzSW5SNWNDSTZJa3BYVkNKOS5leUprWVhSaElqcDdJbUZqZEc5eUlqb2lZMnhsZG1Wc1lXNWtJbjBzSW1wMGFTSTZJamxtWmpjNVpXUTBMVFEzTlRNdE5EWm1PQzFpWWprd0xUa3dNelJoTXpOaU9UY3dN
10:12:56.825755 trace git-lfs: HTTP: aUlzSW1saGRDSTZNVFkyTVRRME1ETTNOaXdpYm1KbUlqb3hOall4TkRRd016Y3hMQ0psZUhBaU9qRTJOakUwTkRjMU56WjkubHhWRVpISWoyQUFUMVJWanRYSWNfbEhxaHd0RHdTME54STBqWWs3WmZiRQ==","Content-Type":"application/octet-stream","Transfer-Encoding":"chunked"}}}}]}
10:12:56.826491 trace git-lfs: tq: starting transfer adapter "basic"
10:12:56.828628 trace git-lfs: HTTP: PUT https://gitlab.com/gitlab-gold/cbledsoe-test/lfs-test.git/gitlab-lfs/objects/a091aba9553e4c171c5b4a7957bf6fe91ecfb9818ec74b286acd696624d27fbe/1073741824
10:17:32.629218 trace git-lfs: HTTP: 200B | 4.1 MB/s
Uploading LFS objects: 100% (1/1), 3.2 GB | 4.1 MB/s, done.
Looking at workhorse logs, we can see a context canceled
which seems to line up with the retries.
RequestBody: upload failed: PUT request "https://storage.googleapis.com/gitlab-gprd-lfs-objects/tmp/uploads/1661440080-54-0001-6169-137c9466a64fb57fa4035b0963548564?GoogleAccessId=gitlab-object-storage-prd@gitlab-production.iam.gserviceaccount.com&Signature=[FILTERED]&Expires=1661455380": Put "https://storage.googleapis.com/gitlab-gprd-lfs-objects/tmp/uploads/1661440080-54-0001-6169-137c9466a64fb57fa4035b0963548564?GoogleAccessId=gitlab-object-storage-prd@gitlab-production.iam.gserviceaccount.com&Signature=1jzTVD%2FtXy5b%2BK%2Fe2n%2FP%2Bt16iUOFLfWgVwFtURWb%2BlYREaK%2FFL%2BigOGVO4MT%0AlZ9xljkjOA18NirOKmlIdr8ossKXvDuM3MVN1uoI%2BY2wEdO4dafJ3gXoUzeD%0ANr%2FO2pwSHODpJ7kvtxUOrJ3RlHxL5dNKv0lSum2TKTcZF6rAkB3uLmAgaBDN%0A5Bk1YP5g3MI6oyKsB7JkPAajLpL4oS27mt45NDCPmB6sJPwFfkEb9HJgHOjB%0AQIQj4jg%2BTxDJ9qyV6nbmT25ZCcJHSbBFkwwurOSG2XwpNjwXxsPJYwAcogBr%0A2LmsYs9M9JKFkFRZNB3ireoQOUF1PjxzNKkTmLkAgg%3D%3D&Expires=1661455380": context canceled
When looking more broadly, we do see some of these errors occasionally. I was able to see the same behavior for a customer who was experiencing i/o timeouts
as well:
- 500 Errors in workhorse (Internal)
- Correlation Dashboard for one request (Internal)
Work in the following issues may help improve this behavior as we're upload to temporary GCS storage and copying from it:
- Implement parallel composite uploads for Google Cloud Storage: #296931
- Make object storage finalize copy to a background job: #296930
What is the current bug behavior?
Uploading LFS objects to a repository occasionally gives an i/o timeout
and takes significantly longer to upload
What is the expected correct behavior?
Uploading LFS objects should not fail or result in a i/o timeout
with multiple retries
Output of checks
GitLab.com 15.4.0-pre b0d2919a
ZD Tickets:
- https://gitlab.zendesk.com/agent/tickets/315881
- https://gitlab.zendesk.com/agent/tickets/343641
- https://gitlab.zendesk.com/agent/tickets/405627