Skip to content

Run GC after each metrics request

Matthias Käppler requested to merge mk-gc-metrics-requests into master

What does this MR do and why?

NOTE: Marked as depending on !77114 (merged), since I would prefer to have request metrics before we make changes that affect performance and memory use.

We recently spun out a separate server process to serve Sidekiq metrics into Prometheus: &6409 (closed)

This came with additional memory use. We already started to optimize this away by forking from a common parent process, which resulted in some decent savings, but overall memory use is still high: #344629 (comment 790913116)

In this MR, we trade off some CPU for a large improvement in memory use; by running a major GC cycle after each /metrics request, we reclaim memory before there is memory pressure. The Ruby VM only triggers a major collection when it risks running out of free object slots by passing configurable thresholds (which to configure are difficult for this server). I believe it is the lack of memory pressure in production that leads to high memory use, since we do not GC frequently enough and let the heap run unnecessarily full.

Since this is a CPU bound operation, it makes requests slightly slower. My measurements suggest that in the best case, we can reclaim as much as 55% of memory this way, while paying with anywhere between 50-100ms of extra latency in the 99th. The Prometheus scraper only visits this endpoint every 10 seconds or so, so it is very low volume. I don't think we need to be concerned about spending a few milliseconds extra on this. End users are not affected by this either.

Note that this is only enabled when serving metrics outside of a worker process, which we do for Sidekiq currently. For Puma and its in-process metrics server, this is disabled, as measurements showed that the added latency is significant.

Screenshots or screen recordings

Before

Memory use

NOTE: Memory use was pretty unstable; because we were not GC'ing each request, it would range anywhere from 70MB to 130MB of RSS. I recorded the worst case I saw in 3 ab runs, since in production we will mostly care about worst cases.

git@bafc7be697e2:~/gitlab$ smem -P sidekiq_exporter
  PID User     Command                         Swap      USS      PSS      RSS 
  187 git      /usr/bin/python /usr/bin/sm        0     8380     8682    12048 
   73 git      sidekiq_exporter                   0    90636   106586   127392

Latencies

git@bafc7be697e2:~/gitlab$ ab -n 100 -c 1 -l localhost:3807/metrics
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done


Server Software:        WEBrick/1.6.1
Server Hostname:        localhost
Server Port:            3807

Document Path:          /metrics
Document Length:        Variable

Concurrency Level:      1
Time taken for tests:   9.882 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      341562772 bytes
HTML transferred:       341541672 bytes
Requests per second:    10.12 [#/sec] (mean)
Time per request:       98.821 [ms] (mean)
Time per request:       98.821 [ms] (mean, across all concurrent requests)
Transfer rate:          33753.75 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:    95   99   3.4     98     125
Waiting:       94   98   3.4     97     124
Total:         95   99   3.4     98     125

Percentage of the requests served within a certain time (ms)
  50%     98
  66%     99
  75%     99
  80%    100
  90%    101
  95%    104
  98%    106
  99%    125
 100%    125 (longest request)

After

Memory use

NOTE: As above, I recorded the worst case in 3 runs, but results were very stable, hovering around these same values:

git@ed7d9cbb080f:~/gitlab$ smem -P sidekiq_exporter
PID User     Command                         Swap      USS      PSS      RSS
205 git      /usr/bin/python /usr/bin/sm        0     8296     8579    12004
 72 git      sidekiq_exporter                   0    17172    34064    55836

Latencies

git@ed7d9cbb080f:~/gitlab$ ab -n 100 -c 1 -l localhost:3807/metrics
This is ApacheBench, Version 2.3 
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software:        WEBrick/1.6.1
Server Hostname:        localhost
Server Port:            3807
Document Path:          /metrics
Document Length:        Variable
Concurrency Level:      1
Time taken for tests:   13.369 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      347980788 bytes
HTML transferred:       347959688 bytes
Requests per second:    7.48 [#/sec] (mean)
Time per request:       133.695 [ms] (mean)
Time per request:       133.695 [ms] (mean, across all concurrent requests)
Transfer rate:          25417.94 [Kbytes/sec] received
Connection Times (ms)
min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   127  134  11.6    132     234
Waiting:      125  131   5.6    130     175
Total:        127  134  11.6    132     234
Percentage of the requests served within a certain time (ms)
50%    132
66%    133
75%    133
80%    134
90%    139
95%    142
98%    177
99%    234
100%    234 (longest request)

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Matthias Käppler

Merge request reports