Skip to content

Puma UNIX socket saturation through Workhorse on new installs

Summary

Multiple concurrent requests to Puma through Workhorse on a default (new) 13.4 install get quickly saturated and Workhorse starts throwing "Resource temporarily unavailable" errors when connecting to the Puma socket.

Large Premium Customer (ZD - internal only) reported getting 502 errors for internal/authorized_keys endpoint after a 13.4.2 upgrade, Workhorse logs show the "Resource temporarily unavailable" errors, this coincided with !4498 (merged) that changed the gitlab-shell communication to go through Workhorse.

A Puma issue seems related intially, that resulted in Puma now defaulting to a backlog of 1024, but there are also reports that it's still limited by the 128 somaxconn default.

Few local tests on a new 13.4 Ubuntu 18.04 instance:

  1. Prerequisite:

    TOKEN=$(cat /opt/gitlab/embedded/service/gitlab-rails/.gitlab_shell_secret | base64 -w0)
    KEY=$(cat /var/opt/gitlab/.ssh/authorized_keys | tail -1 | awk '{print $4}')
    
    echo "gitlab_workhorse['listen_network'] = 'tcp'" >> /etc/gitlab/gitlab.rb
    echo "gitlab_workhorse['listen_addr'] = 'localhost:1234'" >> /etc/gitlab/gitlab.rb
    
    gitlab-ctl reconfigure
  2. 128 somaxconn and Workhorse to Puma through TCP:

    echo 128 > /proc/sys/net/core/somaxconn
    
    # in /etc/gitlab/gitlab.rb
    gitlab_workhorse['auth_socket'] = nil
    gitlab_workhorse['auth_backend'] = 'http://localhost:8080'
    
    gitlab-ctl reconfigure
    gitlab-ctl restart # for the somaxconn change

    Load-testing:

    echo > /var/log/gitlab/gitlab-workhorse/current
    echo > /var/log/gitlab/gitlab-rails/api_json.log
    
    # results
    
    root@cat:~# wrk -t12 -c200 -d5s http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=$KEY -H "Gitlab-Shared-Secret: $TOKEN"
    Running 5s test @ http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=***
      12 threads and 200 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     1.44s   332.05ms   1.93s    78.69%
        Req/Sec    18.71     20.81    90.00     80.54%
      394 requests in 5.01s, 211.97KB read
      Socket errors: connect 0, read 0, write 0, timeout 28
    Requests/sec:     78.61
    Transfer/sec:     42.29KB
    
    root@cat:~# cp /var/log/gitlab/gitlab-workhorse/current{,.test1}
    root@cat:~# cp /var/log/gitlab/gitlab-rails/api_json.log{,.test1}
    
    root@cat:~# cat /var/log/gitlab/gitlab-workhorse/current.test1 | sed '/context canceled/Q' | jq -s -r 'group_by(.status) | .[] | [.[0].status, length, .[0].error] | @tsv'
    200     395

    A subsequent strace also shows lots of connect()s EINPROGRESS:

    15495 15:09:06.817150 connect(97<TCP:[25496678]>, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
    15495 15:09:06.817818 <... connect resumed> ) = -1 EINPROGRESS (Operation now in progress) <0.000125>
    
    root@cat:~# cat /tmp/workhorse_test1.strace  | grep "connect.*EAGAIN" | wc -l
    0
    root@cat:~# cat /tmp/workhorse_test1.strace  | grep "connect.*EINPROGRESS" | wc -l
    428
  3. 128 somaxconn and Workhorse to Puma through UNIX sockets:

    # in /etc/gitlab/gitlab.rb, comment the auth lines
    # gitlab_workhorse['auth_socket'] = nil
    # gitlab_workhorse['auth_backend'] = 'http://localhost:8080'
    
    gitlab-ctl reconfigure

    Load-testing:

    echo > /var/log/gitlab/gitlab-workhorse/current
    echo > /var/log/gitlab/gitlab-rails/api_json.log
    
    # results 
    
    root@cat:~# wrk -t12 -c200 -d5s http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=$KEY -H "Gitlab-Shared-Secret: $TOKEN"
    Running 5s test @ http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=***
      12 threads and 200 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   125.09ms  273.39ms   2.00s    92.24%
        Req/Sec    95.98     83.77   520.00     73.78%
      4636 requests in 5.01s, 690.39KB read
      Socket errors: connect 0, read 0, write 0, timeout 121
      Non-2xx or 3xx responses: 4441
    Requests/sec:    925.12
    Transfer/sec:    137.77KB
    
    root@cat:~# cp /var/log/gitlab/gitlab-workhorse/current{,.test2}
    root@cat:~# cp /var/log/gitlab/gitlab-rails/api_json.log{,.test2}
    
    root@cat:~# cat /var/log/gitlab/gitlab-workhorse/current.test2 | sed '/context canceled/Q' | jq -s -r 'group_by(.status) | .[] | [.[0].status, length, .[0].error] | @tsv'
            4494    badgateway: failed to receive response: dial unix /var/opt/gitlab/gitlab-rails/sockets/gitlab.socket: connect: resource temporarily unavailable
    200     201
    502     4494

    Indeed, a (different) strace shows the EAGAINs on connect() to the socket:

    20244 15:21:23.474462 connect(279<UNIX:[25517705]>, {sa_family=AF_UNIX, sun_path="/var/opt/gitlab/gitlab-rails/sockets/gitlab.socket"}, 53 <unfinished ...>
    20244 15:21:23.474766 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
    
    root@cat:~# cat /tmp/workhorse_test2.strace  | grep "connect.*EAGAIN" | wc -l
    1033
    root@cat:~# cat /tmp/workhorse_test2.strace  | grep "connect.*EINPROGRESS" | wc -l
    0
  4. 1024 somaxconn and Workhorse to Puma through UNIX sockets:

    echo 1024 > /proc/sys/net/core/somaxconn
    
    gitlab-ctl restart

    Load-testing:

    echo > /var/log/gitlab/gitlab-workhorse/current
    echo > /var/log/gitlab/gitlab-rails/api_json.log
    
    root@cat:~# wrk -t12 -c200 -d5s http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=$KEY -H "Gitlab-Shared-Secret: $TOKEN"
    Running 5s test @ http://127.0.0.1:1234/api/v4/internal/authorized_keys?key=****
      12 threads and 200 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     1.55s   513.90ms   2.00s    79.25%
        Req/Sec    24.93     29.02   111.00     82.93%
      431 requests in 5.02s, 231.87KB read
      Socket errors: connect 0, read 0, write 0, timeout 166
    Requests/sec:     85.93
    Transfer/sec:     46.23KB
    
    root@cat:~# cp /var/log/gitlab/gitlab-workhorse/current{,.test3}
    root@cat:~# cp /var/log/gitlab/gitlab-rails/api_json.log{,.test3}
    
    root@cat:~# cat /var/log/gitlab/gitlab-workhorse/current.test3 | sed '/context canceled/Q' | jq -s -r 'group_by(.status) | .[] | [.[0].status, length, .[0].error] | @tsv'
    200     433

    All connect()s in an strace return 0 now, in this case, likely because I'm only doing 200 concurrent.

Worth noting that Linux kernels over 5.4 have somaxconn 4096 by default.

Also, we used to set somaxconn 1024 for Unicorn, stopped doing it for Puma, @stanhu opened an MR in !4688 (merged) for this.

Steps to reproduce

Described above.

Details of package version

gitlab-ee=13.4.3-ee.0

Environment details

  • Operating System: Ubuntu 18.04
  • Installation Target, remove incorrect values:
    • VM: GCP
  • Installation Type, remove incorrect values:
    • New Installation
  • Is there any other software running on the machine: no
  • Is this a single or multiple node installation?
  • Resources
    • CPU: 2cpu
    • Memory total: 8gb

Configuration details

Provide the relevant sections of `/etc/gitlab/gitlab.rb`
external_url 'https://***'
gitlab_workhorse['listen_network'] = "tcp"
gitlab_workhorse['listen_addr'] = 'localhost:1234'
# gitlab_workhorse['auth_socket'] = nil
# gitlab_workhorse['auth_backend'] = 'http://localhost:8080'