Fine tune limits, requests, replicas, and puma settings for git/websockets traffic in kubernetes

This issue will be to discuss how we will set these for both websocket and git traffic, when we move production workloads to the Kubernetes cluster.

Resource configuration

For workhorse we are setting the default:

        resources:
          requests:
            cpu: 100m
            memory: 100M

For puma we are setting:

      resources:
        limits:
          cpu: 1.5
          memory: 2G
        requests:
          cpu: 300m
          memory: 1.5G
      minReplicas: 2
      maxReplicas: 10

And for puma maxmemory/threads:

    puma:
      workerMaxMemory: 1342 # in MB units
      threads:
        min: 1
        max: 4

Request rates in production

Rails

Log query: https://log.gprd.gitlab.net/goto/338024a5e3564d4fc1c86c84dccd9e9b

Workhorse

Log query: https://log.gprd.gitlab.net/goto/bb7f4db5d80a08841f1b2e942e920908

Puma

  • We occasionally see queued connections to puma indicating that occasionally we do not have an available worker, this likely increases our 99th percentile
  • Other metrics to look at are the pool capacity and idle threads

Staging results

1 VM, 16 workers

$ bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        60.00      24.57     299.18
  Latency      240.24ms    57.58ms      1.80s
  Latency Distribution
     50%   209.22ms
     75%   291.00ms
     90%   304.03ms
     95%   317.41ms
     99%   400.81ms
  HTTP codes:
    1xx - 0, 2xx - 35541, 3xx - 0, 4xx - 0, 5xx - 456
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
    stream error: stream ID 10655; INTERNAL_ERROR - 1
    stream error: stream ID 10651; INTERNAL_ERROR - 1
    stream error: stream ID 10653; INTERNAL_ERROR - 1
  Throughput:    13.84MB/s

8 pods x 2 workers per pod

2020-08-11: 14:25-14:35

$ bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        59.99      20.55     288.59
  Latency      195.31ms    31.07ms      1.40s
  Latency Distribution
     50%   187.89ms
     75%   196.82ms
     90%   213.87ms
     95%   246.52ms
     99%   322.87ms
  HTTP codes:
    1xx - 0, 2xx - 35997, 3xx - 0, 4xx - 0, 5xx - 0
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
  Throughput:    14.02MB/s

16 pods x 1 workers per pod

$ date -u; bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack; date -u
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        59.99      23.20     282.96
  Latency      199.24ms    35.55ms      1.12s
  Latency Distribution
     50%   187.21ms
     75%   203.01ms
     90%   230.40ms
     95%   276.60ms
     99%   348.42ms
  HTTP codes:
    1xx - 0, 2xx - 35997, 3xx - 0, 4xx - 0, 5xx - 0
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
  Throughput:    14.02MB/s

Screen_Shot_2020-08-13_at_4.10.42_PM

4 pods x 4 workers per pod

$ date -u; bombardier --header="Host: staging.gitlab.com" -l -d600s --http2 -r60  https://staging.gitlab.com/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack; date -u
Thu Aug 13 14:57:50 UTC 2020
Bombarding https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack for 10m0s using 125 connection(s)
[==============================================================================================================] 10m0s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        59.99      21.80     268.34
  Latency      203.34ms    47.21ms      1.33s
  Latency Distribution
     50%   188.76ms
     75%   204.16ms
     90%   239.92ms
     95%   292.96ms
     99%   403.89ms
  HTTP codes:
    1xx - 0, 2xx - 35997, 3xx - 0, 4xx - 0, 5xx - 0
    others - 6
  Errors:
    Get https://staging.gitlab.com:443/gitlab-org/gitlab-ee.git/info/refs?service=git-upload-pack: http2: Transport: peer server initiated graceful shutdown after some of Request.Body was written; define Request.GetBody to avoid this error - 6
  Throughput:    14.02MB/s
Thu Aug 13 15:07:51 UTC 2020

Screen_Shot_2020-08-13_at_5.09.34_PM

Edited by John Jarvis