Skip to content

Investigate handleUploadPack: smarthttp.UploadPack broken pipe

Summary

I've been investigating this for a customer (internal only), but I'm finding it difficult to pinpoint an exact cause.

Git clones from runner via HTTPS are failing during get_sources stage when git is exchanging packs with the server. It specifically fails during handleUploadPack on the Workhorse side, presents itself as an error during /gitaly.SmartHTTPService/PostUploadPack on the Gitaly side, and returns a 500 to nginx that manifests as errno 104 (ECONNRESET) and errno 32 (EPIPE).

I found a related, but old, issue on the gitaly project, as well as this (also old) issue on workhorse. There 2.2 million logged events in sentry for this error. No doubt, many of them are easily explainable by client issues, server misconfigurations, etc.

But in reviewing all of the information I have thus far collected from the customer, I cannot explain why it's happening, nor can I determine which side of the workhorse socket is hanging up first. This is happening only on one of their projects. They have runners on 3 separate machines (12.10 x1, 13.3 x2), all of which experience the problem. The job fails approximately 900 seconds after the EPIPE error, which I assume is a response timeout from one of the involved services.

Of note, setting the variable GET_SOURCES_ATTEMPTS: 3 in their CI works around the issue, presumably because runner restarts the clone from where it failed rather than giving up after the first attempt.

Logs and data

To begin with, git client (git/2.22.4) establishes an HTTP/2 via TLS with nginx and makes it through fetch-pack (want <ref> multi_ack_detailed no-done side-band-64k thin-pack no-progress include-tag ofs-delta deepen-since deepen-not agent=git/2.22.4), and sends a POST to /git-upload-pack

18:29:48.832906 http.c:688              => Send header: POST /namespace/project.git/git-upload-pack HTTP/2
18:29:48.832911 http.c:688              => Send header: Host: <redacted>
18:29:48.832916 http.c:688              => Send header: Authorization: Basic <redacted>
18:29:48.832920 http.c:688              => Send header: User-Agent: git/2.22.4
18:29:48.832925 http.c:688              => Send header: Accept-Encoding: deflate, gzip
18:29:48.832930 http.c:688              => Send header: Content-Type: application/x-git-upload-pack-request
18:29:48.832935 http.c:688              => Send header: Accept: application/x-git-upload-pack-result
18:29:48.832940 http.c:688              => Send header: Content-Encoding: gzip
18:29:48.832944 http.c:688              => Send header: Content-Length: 4696
...
...
18:29:48.834612 http.c:729              == Info: We are completely uploaded and fine

server replies 200

18:29:48.891611 http.c:676              <= Recv header, 0000000013 bytes (0x0000000d)
18:29:48.891616 http.c:688              <= Recv header: HTTP/2 200
18:29:48.891620 http.c:676              <= Recv header, 0000000015 bytes (0x0000000f)
18:29:48.891624 http.c:688              <= Recv header: server: nginx
18:29:48.891628 http.c:676              <= Recv header, 0000000037 bytes (0x00000025)
18:29:48.891631 http.c:688              <= Recv header: date: Sat, 15 Aug 2020 18:29:48 GMT
18:29:48.891636 http.c:676              <= Recv header, 0000000052 bytes (0x00000034)
18:29:48.891639 http.c:688              <= Recv header: content-type: application/x-git-upload-pack-result
18:29:48.891643 http.c:676              <= Recv header, 0000000025 bytes (0x00000019)
18:29:48.891647 http.c:688              <= Recv header: cache-control: no-cache
18:29:48.891651 http.c:676              <= Recv header, 0000000045 bytes (0x0000002d)
18:29:48.891654 http.c:688              <= Recv header: strict-transport-security: max-age=31536000
18:29:48.891658 http.c:676              <= Recv header, 0000000050 bytes (0x00000032)
18:29:48.891661 http.c:688              <= Recv header: referrer-policy: strict-origin-when-cross-origin
18:29:48.891666 http.c:676              <= Recv header, 0000000002 bytes (0x00000002)

and begins transmitting pack data. Client-side git runs this command a few times while receiving pack data:

18:29:48.892997 run-command.c:663       trace: run_command: git index-pack --stdin --fix-thin '--keep=fetch-pack 22 on runner-dvgbvec2-project-14-concurrent-6' --pack_header=2,5989
...
...
18:29:48.894291 git.c:439               trace: built-in: git index-pack --stdin --fix-thin '--keep=fetch-pack 22 on runner-dvgbvec2-project-14-concurrent-6' --pack_header=2,5989

Beginning with timestamp 18:29:48.894504, client is only receiving data over TLS. Data is sent from the server until this moment:

18:30:48.551922 http.c:717              <= Recv data: a8......
18:30:48.551943 http.c:702              <= Recv SSL data, 0000000005 bytes (0x00000005)
18:30:48.551949 http.c:717              <= Recv SSL data: .....

^ Those 5 bytes signify that another chunk should be on its way (each chunk begins with a CRLF, 5bytes long), but nothing happens again for approximately 10 minutes, at which point:

18:40:59.768679 http.c:729              == Info: OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 110
18:40:59.768704 http.c:729              == Info: Failed receiving HTTP2 data
18:40:59.768735 http.c:702              => Send SSL data, 0000000005 bytes (0x00000005)
18:40:59.768740 http.c:717              => Send SSL data: .....
18:40:59.768746 http.c:702              => Send SSL data, 0000000001 bytes (0x00000001)
18:40:59.768750 http.c:717              => Send SSL data: .
18:40:59.768760 http.c:729              == Info: SSL_write() returned SYSCALL, errno = 32
18:40:59.768765 http.c:729              == Info: Failed sending HTTP2 data
18:40:59.768777 http.c:729              == Info: Connection #0 to host <redacted> left intact
error: RPC failed; curl 56 OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 110
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

The above looks to indicate that the ECONNRESET (connection reset by peer) is not occurring on the git client side, but it is possible that nginx is sending a RST.

On the workhorse side, I had customer attach to workhorse using strace in hopes of getting more data to explain where the hangup happens. Timestamps are different from above due to a different pipeline being run to trace the error.

Using the time from which workhorse writes the error to its log file:

05:53:15.088961 write(2<pipe:[15757]>, "{\"correlation_id\":\"vFxe8sG0oO8\",\"error\":\"handleUploadPack: smarthttp.UploadPack: write unix /var/opt/gitlab/gitlab-workhorse/socket-\\u003e@: write: broken pipe\",\"level\":\"error\",\"method\":\"POST\",\"msg\":\"error\",\"time\":\"2020-08-25T05:53:15Z\",\"uri\":\"/namespace/project.git/git-upload-pack\"}\n", 294) = 294 <0.000154>

We can work backwards to first locate the EPIPE:

write(9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>, "\r\n8000\r\n[TLS-ENCRYPTED DATA GOES HERE]"..., 4096) = -1 EPIPE (Broken pipe) <0.000019>
05:53:15.087771 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=31332, si_uid=998} ---
05:53:15.087819 rt_sigreturn({mask=[]}) = -1 EPIPE (Broken pipe) <0.000075>
05:53:15.087944 epoll_ctl(5<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>, 0xc0013b3734) = 0 <0.000022>
05:53:15.087999 close(9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>) = 0 <0.000016>

Which correlates with this read() directly before it:

05:53:15.087664 read(9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>, 0xc0004919f1, 1) = -1 ECONNRESET (Connection reset by peer) <0.000047>

Description of UDS connection 2085213623->2085216300:

05:53:07.107755 accept4(3<UNIX:[1546455873,"/var/opt/gitlab/gitlab-workhorse/socket"]>, {sa_family=AF_UNIX}, [112->2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]> <0.000023>
05:53:07.108231 epoll_ctl(5<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=136249968, u64=140664610161264}}) = 0 <0.000015>
05:53:07.108280 getsockname(9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>, {sa_family=AF_UNIX, sun_path="/var/opt/gitlab/gitlab-workhorse/socket"}, [112->42]) = 0 <0.000016>

And judging from the following read() over that socket, gitlab-runner is (somewhere) on the other end:

05:53:07.108488 read(9<UNIX:[2085213623->2085216300,"/var/opt/gitlab/gitlab-workhorse/socket"]>, "POST /namespace/project.git/git-upload-pack HTTP/1.1\r\nHost: <redacted>\r\nX-Real-IP: x.x.x.x\r\nX-Forwarded-For: x.x.x.x\r\nConnection: close\r\nX-Forwarded-Proto: https\r\nX-Forwarded-Ssl: on\r\nContent-Length: 4920\r\nauthorization: Basic <redacted>\r\nuser-agent: git/2.22.4\r\naccept-encoding: deflate, gzip\r\ncontent-type: application/x-git-upload-pack-request\r\naccept: application/x-git-upload-pack-result\r\ncontent-encoding: gzip\r\n\r\n[TLS-ENCRYPTED DATA GOES HERE]"..., 4096) = 4096 <0.000012>

Approx 10 minutes later, runner writes back to workhorse to update the job log:

06:03:27.989667 read(10<UNIX:[2085398772->2085395106,"/var/opt/gitlab/gitlab-workhorse/socket"]>, "PATCH /api/v4/jobs/315155/trace HTTP/1.1\r\nHost: <redacted>\r\nX-Real-IP: x.x.x.x\r\nX-Forwarded-For: x.x.x.x\r\nConnection: close\r\nX-Forwarded-Proto: https\r\nX-Forwarded-Ssl: on\r\nContent-Length: 159\r\nUser-Agent: gitlab-runner 12.10.2 (12-10-stable; go1.13.8; linux/amd64)\r\nContent-Range: 1257-1415\r\nContent-Type: text/plain\r\nJob-Token: 7Kbha5x9qtzGPSqf3Pgs\r\nAccept-Encoding: gzip\r\n\r\nerror: RPC failed; curl 56 OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 110\nfatal: the remote end hung up unexpectedly\nfatal: early EOF\nfatal: index-pack failed\n", 4096) = 557 <0.000017>

And workhorse sends a final reply:

06:03:31.410050 write(10<UNIX:[2085398877->2085393352,"/var/opt/gitlab/gitlab-workhorse/socket"]>, "HTTP/1.1 200 OK\r\nCache-Control: max-age=0, private, must-revalidate\r\nContent-Type: application/json; charset=utf-8\r\nEtag: W/\"00b536d10d0c9550e7a8d5bda1208623\"\r\nPage-Title: GitLab\r\nPragma: no-cache\r\nReferrer-Policy: strict-origin-when-cross-origin\r\nX-Content-Type-Options: nosniff\r\nX-Download-Options: noopen\r\nX-Frame-Options: DENY\r\nX-Permitted-Cross-Domain-Policies: none\r\nX-Request-Id: qgYaP5fXGR3\r\nX-Runtime: 0.029882\r\nX-Ua-Compatible: IE=edge\r\nX-Xss-Protection: 1; mode=block\r\nDate: Tue, 25 Aug 2020 06:03:31 GMT\r\nContent-Length: 958\r\nConnection: close\r\n\r\n{\"id\":315155,\"status\":\"running\",\"complete\":false,\"state\":\"eyJvZmZzZXQiOjE1NjMsInN0eWxlIjp7ImZnIjpudWxsLCJiZyI6bnVsbCwibWFzayI6MH0sIm9wZW5fc2VjdGlvbnMiOnsidXBsb2FkLWFydGlmYWN0cy1vbi1mYWlsdXJlIjoiMTU5ODMzNTQwOSJ9fQ==\",\"append\":true,\"truncated\":false,\"offset\":1257,\"size\":311,\"total\":1568,\"lines\":[{\"offset\":1257,\"content\":[{\"text\":\"error: RPC failed; curl 56 OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 110\"}],\"section\":\"get-sources\"},{\"offset\":1331,\"content\":[{\"text\""..., 1517) = 1517 <0.000017>

From a high level, workhorse is sending / receiving over its domain socket without issue until an unexpected hangup occurs, terminating the entire process. It seems possible that nginx is resetting the connection, or closing its end of the socket prematurely, but no errors are logged by nginx.

Gitaly records a cancelled RPC, but this appears to be a symptom more than anything else (timestamps are different from above):

{"correlation_id":"TI6MnBfl8Ca","error":"rpc error: code = Canceled desc = rpc error: code = Unavailable desc = PostUploadPack: signal: terminated","grpc.code":"Canceled","grpc.meta.auth_version":"v2","grpc.meta.deadline_type":"none","grpc.method":"PostUploadPack","grpc.request.fullMethod":"/gitaly.SmartHTTPService/PostUploadPack","grpc.request.glProjectPath":"namespace/project","grpc.request.glRepository":"project-14","grpc.request.repoPath":"namespace/project.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"namespace","grpc.service":"gitaly.SmartHTTPService","grpc.start_time":"2020-08-06T13:07:18Z","grpc.time_ms":4455.213,"level":"info","msg":"finished streaming call with code Canceled","peer.address":"@","pid":7401,"span.kind":"server","system":"grpc","time":"2020-08-06T13:07:23.435Z"}

Other things

  • Through multiple gitlabsos and with the customer's own investigation, I've ruled out disk space, memory or CPU factors
  • there are no errors logged by nginx
  • we have confirmed this happens with two different major versions of gitlab runner (12.10 and 13.3)
  • default nginx settings
  • no errors are reported by runner, even in debug mode

Omnibus GitLab, 13.0.6-ee at start, upgraded to 13.3.1-ee

version-manifest.json is attached

Support ticket has numerous attachments, including full strace archive (internal only)

I realized after writing this up that the issue may be more appropriate on the main GitLab project, feel free to move there if you agree. I originally chose workhorse project because workhorse is providing the most obvious error messaging.