Fine tune limits, requests, replicas, and puma settings for git/websockets traffic in kubernetes

This issue will be to discuss how we will set these for both websocket and git traffic, when we move production workloads to the Kubernetes cluster.

Resource configuration

For workhorse we are setting the default:

        resources:
          requests:
            cpu: 100m
            memory: 100M

For puma we are setting:

      resources:
        limits:
          cpu: 1.5
          memory: 2G
        requests:
          cpu: 300m
          memory: 1.5G
      minReplicas: 2
      maxReplicas: 10

And for puma maxmemory/threads:

    puma:
      workerMaxMemory: 1342 # in MB units
      threads:
        min: 1
        max: 4

Request rates in production

Rails

Log query: https://log.gprd.gitlab.net/goto/338024a5e3564d4fc1c86c84dccd9e9b

Currently, in production there are 16 custom-16-20486 VMs servicing git-ssh, git-https, and websocket traffic.
Each VM is configured for 16 puma workers with up to 4 threads
Peak traffic for git https traffic is at ~12:00 UTC where we see up to 2,400 requests / minute on a single VM, or 40 RPS for git requests to rails, the majority of which are info refs https://log.gprd.gitlab.net/goto/b25ce0227641f488fe0732c49caf478a . Divided by 16 workers that means each worker is processing ~3 req/sec per worker.
For latency: https://log.gprd.gitlab.net/goto/19a7d4d88654bd46c6af9b6c9c2813f4
- 99th percentile: ~0.75s
- 50th percentile: ~0.07s

Workhorse

Log query: https://log.gprd.gitlab.net/goto/bb7f4db5d80a08841f1b2e942e920908

For workhorse latency: https://log.gprd.gitlab.net/goto/db0b77a82e5a91a9c98e78f361025239
- 99th percentile: ~1.5s
- 50th percentile: ~.10s

Puma

We occasionally see queued connections to puma indicating that occasionally we do not have an available worker, this likely increases our 99th percentile
Other metrics to look at are the pool capacity and idle threads

Staging results

1 VM, 16 workers

$ bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        60.00      24.57     299.18
  Latency      240.24ms    57.58ms      1.80s
  Latency Distribution
     50%   209.22ms
     75%   291.00ms
     90%   304.03ms
     95%   317.41ms
     99%   400.81ms
  HTTP codes:
    1xx - 0, 2xx - 35541, 3xx - 0, 4xx - 0, 5xx - 456
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
    stream error: stream ID 10655; INTERNAL_ERROR - 1
    stream error: stream ID 10651; INTERNAL_ERROR - 1
    stream error: stream ID 10653; INTERNAL_ERROR - 1
  Throughput:    13.84MB/s

Workhorse latency: https://nonprod-log.gitlab.net/goto/6d4de876326cdba2f0f83f13b206eae0
- 99th percentile: .2s
- 50th percentile: .11s

8 pods x 2 workers per pod

2020-08-11: 14:25-14:35

$ bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        59.99      20.55     288.59
  Latency      195.31ms    31.07ms      1.40s
  Latency Distribution
     50%   187.89ms
     75%   196.82ms
     90%   213.87ms
     95%   246.52ms
     99%   322.87ms
  HTTP codes:
    1xx - 0, 2xx - 35997, 3xx - 0, 4xx - 0, 5xx - 0
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
  Throughput:    14.02MB/s

Workhorse latency: https://nonprod-log.gitlab.net/goto/be18b2d88309cfdbe5fda988d8f1c717
- 99th percentile: ~.2s
- 50th percentile: ~.09s
CPU utilization
- .3 - .4 cores per container
Memory
- 1.5 - 2GB

16 pods x 1 workers per pod

$ date -u; bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack; date -u
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        59.99      23.20     282.96
  Latency      199.24ms    35.55ms      1.12s
  Latency Distribution
     50%   187.21ms
     75%   203.01ms
     90%   230.40ms
     95%   276.60ms
     99%   348.42ms
  HTTP codes:
    1xx - 0, 2xx - 35997, 3xx - 0, 4xx - 0, 5xx - 0
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
  Throughput:    14.02MB/s

Workhorse latency: https://nonprod-log.gitlab.net/goto/c15ab8943f060c53ed02bd75380dc0b5
- 99th percentile: ~.2s
- 50th percentile: ~.1s
CPU utilization
- ~.2 cores per container

Memory
- ~1.5

4 pods x 4 workers per pod

$ date -u; bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack; date -u
Thu Aug 13 14:57:50 UTC 2020
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        59.99      21.80     268.34
  Latency      203.34ms    47.21ms      1.33s
  Latency Distribution
     50%   188.76ms
     75%   204.16ms
     90%   239.92ms
     95%   292.96ms
     99%   403.89ms
  HTTP codes:
    1xx - 0, 2xx - 35997, 3xx - 0, 4xx - 0, 5xx - 0
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
  Throughput:    14.02MB/s
Thu Aug 13 15:07:51 UTC 2020

Workhorse latency: https://nonprod-log.gitlab.net/goto/8cbefbb1f168d435f3ceba057cf3ba9e
- 99th percentile: ~.2s
- 50th percentile: ~.1s
CPU utilization
- ~.8 cores per container

Memory
- ~2.5

Edited Aug 13, 2020 by John Jarvis