Puma UNIX socket saturation through Workhorse on new installs
Summary
Multiple concurrent requests to Puma through Workhorse on a default (new) 13.4 install get quickly saturated and Workhorse starts throwing "Resource temporarily unavailable" errors when connecting to the Puma socket.
Large Premium Customer (ZD - internal only) reported getting 502 errors for internal/authorized_keys
endpoint after a 13.4.2 upgrade, Workhorse logs show the "Resource temporarily unavailable" errors, this coincided with !4498 (merged) that changed the gitlab-shell communication to go through Workhorse.
A Puma issue seems related intially, that resulted in Puma now defaulting to a backlog of 1024, but there are also reports that it's still limited by the 128 somaxconn default.
Few local tests on a new 13.4 Ubuntu 18.04 instance:
-
Prerequisite:
TOKEN=$(cat /opt/gitlab/embedded/service/gitlab-rails/.gitlab_shell_secret | base64 -w0) KEY=$(cat /var/opt/gitlab/.ssh/authorized_keys | tail -1 | awk '{print $4}') echo "gitlab_workhorse['listen_network'] = 'tcp'" >> /etc/gitlab/gitlab.rb echo "gitlab_workhorse['listen_addr'] = 'localhost:1234'" >> /etc/gitlab/gitlab.rb gitlab-ctl reconfigure
-
128 somaxconn and Workhorse to Puma through TCP:
echo 128 > /proc/sys/net/core/somaxconn # in /etc/gitlab/gitlab.rb gitlab_workhorse['auth_socket'] = nil gitlab_workhorse['auth_backend'] = 'http://localhost:8080' gitlab-ctl reconfigure gitlab-ctl restart # for the somaxconn change
Load-testing:
echo > /var/log/gitlab/gitlab-workhorse/current echo > /var/log/gitlab/gitlab-rails/api_json.log # results root@cat:~# wrk -t12 -c200 -d5s http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=$KEY -H "Gitlab-Shared-Secret: $TOKEN" Running 5s test @ http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=*** 12 threads and 200 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.44s 332.05ms 1.93s 78.69% Req/Sec 18.71 20.81 90.00 80.54% 394 requests in 5.01s, 211.97KB read Socket errors: connect 0, read 0, write 0, timeout 28 Requests/sec: 78.61 Transfer/sec: 42.29KB root@cat:~# cp /var/log/gitlab/gitlab-workhorse/current{,.test1} root@cat:~# cp /var/log/gitlab/gitlab-rails/api_json.log{,.test1} root@cat:~# cat /var/log/gitlab/gitlab-workhorse/current.test1 | sed '/context canceled/Q' | jq -s -r 'group_by(.status) | .[] | [.[0].status, length, .[0].error] | @tsv' 200 395
A subsequent strace also shows lots of
connect()
s EINPROGRESS:15495 15:09:06.817150 connect(97<TCP:[25496678]>, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...> 15495 15:09:06.817818 <... connect resumed> ) = -1 EINPROGRESS (Operation now in progress) <0.000125> root@cat:~# cat /tmp/workhorse_test1.strace | grep "connect.*EAGAIN" | wc -l 0 root@cat:~# cat /tmp/workhorse_test1.strace | grep "connect.*EINPROGRESS" | wc -l 428
-
128 somaxconn and Workhorse to Puma through UNIX sockets:
# in /etc/gitlab/gitlab.rb, comment the auth lines # gitlab_workhorse['auth_socket'] = nil # gitlab_workhorse['auth_backend'] = 'http://localhost:8080' gitlab-ctl reconfigure
Load-testing:
echo > /var/log/gitlab/gitlab-workhorse/current echo > /var/log/gitlab/gitlab-rails/api_json.log # results root@cat:~# wrk -t12 -c200 -d5s http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=$KEY -H "Gitlab-Shared-Secret: $TOKEN" Running 5s test @ http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=*** 12 threads and 200 connections Thread Stats Avg Stdev Max +/- Stdev Latency 125.09ms 273.39ms 2.00s 92.24% Req/Sec 95.98 83.77 520.00 73.78% 4636 requests in 5.01s, 690.39KB read Socket errors: connect 0, read 0, write 0, timeout 121 Non-2xx or 3xx responses: 4441 Requests/sec: 925.12 Transfer/sec: 137.77KB root@cat:~# cp /var/log/gitlab/gitlab-workhorse/current{,.test2} root@cat:~# cp /var/log/gitlab/gitlab-rails/api_json.log{,.test2} root@cat:~# cat /var/log/gitlab/gitlab-workhorse/current.test2 | sed '/context canceled/Q' | jq -s -r 'group_by(.status) | .[] | [.[0].status, length, .[0].error] | @tsv' 4494 badgateway: failed to receive response: dial unix /var/opt/gitlab/gitlab-rails/sockets/gitlab.socket: connect: resource temporarily unavailable 200 201 502 4494
Indeed, a (different) strace shows the
EAGAIN
s onconnect()
to the socket:20244 15:21:23.474462 connect(279<UNIX:[25517705]>, {sa_family=AF_UNIX, sun_path="/var/opt/gitlab/gitlab-rails/sockets/gitlab.socket"}, 53 <unfinished ...> 20244 15:21:23.474766 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> root@cat:~# cat /tmp/workhorse_test2.strace | grep "connect.*EAGAIN" | wc -l 1033 root@cat:~# cat /tmp/workhorse_test2.strace | grep "connect.*EINPROGRESS" | wc -l 0
-
1024 somaxconn and Workhorse to Puma through UNIX sockets:
echo 1024 > /proc/sys/net/core/somaxconn gitlab-ctl restart
Load-testing:
echo > /var/log/gitlab/gitlab-workhorse/current echo > /var/log/gitlab/gitlab-rails/api_json.log root@cat:~# wrk -t12 -c200 -d5s http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=$KEY -H "Gitlab-Shared-Secret: $TOKEN" Running 5s test @ http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=**** 12 threads and 200 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.55s 513.90ms 2.00s 79.25% Req/Sec 24.93 29.02 111.00 82.93% 431 requests in 5.02s, 231.87KB read Socket errors: connect 0, read 0, write 0, timeout 166 Requests/sec: 85.93 Transfer/sec: 46.23KB root@cat:~# cp /var/log/gitlab/gitlab-workhorse/current{,.test3} root@cat:~# cp /var/log/gitlab/gitlab-rails/api_json.log{,.test3} root@cat:~# cat /var/log/gitlab/gitlab-workhorse/current.test3 | sed '/context canceled/Q' | jq -s -r 'group_by(.status) | .[] | [.[0].status, length, .[0].error] | @tsv' 200 433
All
connect()
s in an strace return 0 now, in this case, likely because I'm only doing 200 concurrent.
Worth noting that Linux kernels over 5.4 have somaxconn 4096 by default.
Also, we used to set somaxconn
1024 for Unicorn, stopped doing it for Puma, @stanhu opened an MR in !4688 (merged) for this.
Steps to reproduce
Described above.
Details of package version
gitlab-ee=13.4.3-ee.0
Environment details
- Operating System:
Ubuntu 18.04
- Installation Target, remove incorrect values:
- VM: GCP
- Installation Type, remove incorrect values:
- New Installation
- Is there any other software running on the machine:
no
- Is this a single or multiple node installation?
- Resources
- CPU:
2cpu
- Memory total:
8gb
- CPU:
Configuration details
Provide the relevant sections of `/etc/gitlab/gitlab.rb`
external_url 'https://***' gitlab_workhorse['listen_network'] = "tcp" gitlab_workhorse['listen_addr'] = 'localhost:1234' # gitlab_workhorse['auth_socket'] = nil # gitlab_workhorse['auth_backend'] = 'http://localhost:8080'