Run GC after each metrics request
What does this MR do and why?
NOTE: Marked as depending on !77114 (merged), since I would prefer to have request metrics before we make changes that affect performance and memory use.
We recently spun out a separate server process to serve Sidekiq metrics into Prometheus: &6409 (closed)
This came with additional memory use. We already started to optimize this away by forking from a common parent process, which resulted in some decent savings, but overall memory use is still high: #344629 (comment 790913116)
In this MR, we trade off some CPU for a large improvement in memory use; by running a major GC cycle after each /metrics
request, we reclaim memory before there is memory pressure. The Ruby VM only triggers a major collection when it risks running out of free object slots by passing configurable thresholds (which to configure are difficult for this server). I believe it is the lack of memory pressure in production that leads to high memory use, since we do not GC frequently enough and let the heap run unnecessarily full.
Since this is a CPU bound operation, it makes requests slightly slower. My measurements suggest that in the best case, we can reclaim as much as 55% of memory this way, while paying with anywhere between 50-100ms of extra latency in the 99th. The Prometheus scraper only visits this endpoint every 10 seconds or so, so it is very low volume. I don't think we need to be concerned about spending a few milliseconds extra on this. End users are not affected by this either.
Note that this is only enabled when serving metrics outside of a worker process, which we do for Sidekiq currently. For Puma and its in-process metrics server, this is disabled, as measurements showed that the added latency is significant.
Screenshots or screen recordings
Before
Memory use
NOTE: Memory use was pretty unstable; because we were not GC'ing each request, it would range anywhere from 70MB to 130MB of RSS. I recorded the worst case I saw in 3 ab
runs, since in production we will mostly care about worst cases.
git@bafc7be697e2:~/gitlab$ smem -P sidekiq_exporter
PID User Command Swap USS PSS RSS
187 git /usr/bin/python /usr/bin/sm 0 8380 8682 12048
73 git sidekiq_exporter 0 90636 106586 127392
Latencies
git@bafc7be697e2:~/gitlab$ ab -n 100 -c 1 -l localhost:3807/metrics
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software: WEBrick/1.6.1
Server Hostname: localhost
Server Port: 3807
Document Path: /metrics
Document Length: Variable
Concurrency Level: 1
Time taken for tests: 9.882 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 341562772 bytes
HTML transferred: 341541672 bytes
Requests per second: 10.12 [#/sec] (mean)
Time per request: 98.821 [ms] (mean)
Time per request: 98.821 [ms] (mean, across all concurrent requests)
Transfer rate: 33753.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 95 99 3.4 98 125
Waiting: 94 98 3.4 97 124
Total: 95 99 3.4 98 125
Percentage of the requests served within a certain time (ms)
50% 98
66% 99
75% 99
80% 100
90% 101
95% 104
98% 106
99% 125
100% 125 (longest request)
After
Memory use
NOTE: As above, I recorded the worst case in 3 runs, but results were very stable, hovering around these same values:
git@ed7d9cbb080f:~/gitlab$ smem -P sidekiq_exporter
PID User Command Swap USS PSS RSS
205 git /usr/bin/python /usr/bin/sm 0 8296 8579 12004
72 git sidekiq_exporter 0 17172 34064 55836
Latencies
git@ed7d9cbb080f:~/gitlab$ ab -n 100 -c 1 -l localhost:3807/metrics
This is ApacheBench, Version 2.3
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software: WEBrick/1.6.1
Server Hostname: localhost
Server Port: 3807
Document Path: /metrics
Document Length: Variable
Concurrency Level: 1
Time taken for tests: 13.369 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 347980788 bytes
HTML transferred: 347959688 bytes
Requests per second: 7.48 [#/sec] (mean)
Time per request: 133.695 [ms] (mean)
Time per request: 133.695 [ms] (mean, across all concurrent requests)
Transfer rate: 25417.94 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 127 134 11.6 132 234
Waiting: 125 131 5.6 130 175
Total: 127 134 11.6 132 234
Percentage of the requests served within a certain time (ms)
50% 132
66% 133
75% 133
80% 134
90% 139
95% 142
98% 177
99% 234
100% 234 (longest request)
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.