Some puma processes consuming lots of memory
Summary of findings
- Problem: What occasionally causes a puma worker to enormously bloat its memory usage to the point where kernel must OOM kill it?
-
Answer: An exception handler (sentry) was trying to generate an 80 GB string.
- The exception handler was trying to compose a string to show the state of objects it thought were relevant.
- In doing so, it recursively descended into the object hierarchy of the in-memory prometheus metrics, specifically a set of thousands of
Prometheus::Client::MmapedValue
objects. - Each of those
MmapedValue
objects points to a shared data structure (MmapedDict which wraps MmapedFile) that contains all stored metrics. So the string representation of eachMmapedValue
object included a copy of the entire 16 MBMmapedFile
object, as represented by the@positions
attribute of the singletonMmapedDict
object. - So we don't know what exception was being thrown (because the string never finished being generated). But @stanhu has merged a fix to MmapedDict (https://gitlab.com/gitlab-org/prometheus-client-mmap/-/merge_requests/51/diffs) to make its
inspect
method return just its object id rather than its complete contents. That should drastically reduce the size of the exception message, so next time this happens, it should be small enough that Sentry actually gets to log the exception rather than starve for memory.
Original details when we started this investigation
@andrewn posted this:
@sre-oncall I stumbled across this unusual memory usage pattern on web-18. https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=63&orgId=1
7 day view:
ps
shows pid 14090
comm puma: cluster worker 5
using lots of memory (and cpu):
git 10709 31.4 1.4 5083860 954136 ? Ssl Jul17 1249:20 | \_ puma 4.3.3.gitlab.2 (unix:///var/opt/gitlab/gitlab-rails/sockets/gitlab.socket,tcp://0.0.0.0:8080) [gi
git 14022 32.3 3.1 6589264 2047336 ? Sl 04:03 186:01 | \_ puma: cluster worker 0: 10709 [gitlab-puma-worker]
git 14024 32.2 3.7 6891540 2446168 ? Sl 04:03 185:28 | \_ puma: cluster worker 1: 10709 [gitlab-puma-worker]
git 14086 33.5 2.9 6368668 1974008 ? Sl 04:03 192:49 | \_ puma: cluster worker 3: 10709 [gitlab-puma-worker]
git 14088 32.6 2.8 6455624 1885604 ? Sl 04:03 187:46 | \_ puma: cluster worker 4: 10709 [gitlab-puma-worker]
git 14090 37.8 55.4 91223712 36604016 ? Sl 04:03 217:59 | \_ puma: cluster worker 5: 10709 [gitlab-puma-worker]
git 14092 33.9 2.7 6386488 1815192 ? Sl 04:03 195:27 | \_ puma: cluster worker 6: 10709 [gitlab-puma-worker]
git 14213 34.2 2.5 6338872 1702756 ? Sl 04:03 197:11 | \_ puma: cluster worker 8: 10709 [gitlab-puma-worker]
git 32623 41.2 2.0 6224160 1382136 ? Sl 07:44 146:17 | \_ puma: cluster worker 7: 10709 [gitlab-puma-worker]
git 1284 39.2 2.0 6054716 1352964 ? Sl 08:40 117:10 | \_ puma: cluster worker 14: 10709 [gitlab-puma-worker]
git 2268 41.9 2.1 6548104 1420500 ? Sl 08:43 124:08 | \_ puma: cluster worker 13: 10709 [gitlab-puma-worker]
git 3234 40.0 2.1 6123304 1405672 ? Sl 08:45 117:30 | \_ puma: cluster worker 15: 10709 [gitlab-puma-worker]
git 6609 39.7 2.0 6669448 1361928 ? Sl 08:54 113:07 | \_ puma: cluster worker 12: 10709 [gitlab-puma-worker]
git 12988 38.8 2.1 6183048 1445100 ? Sl 09:10 104:35 | \_ puma: cluster worker 9: 10709 [gitlab-puma-worker]
git 25519 39.9 2.1 6144648 1444748 ? Sl 09:27 100:43 | \_ puma: cluster worker 2: 10709 [gitlab-puma-worker]
git 18468 38.7 2.0 6036620 1379912 ? Sl 10:17 78:05 | \_ puma: cluster worker 11: 10709 [gitlab-puma-worker]
git 24371 49.1 1.8 5467272 1223292 ? Sl 13:14 12:01 | \_ puma: cluster worker 10: 10709 [gitlab-puma-worker]
pstree
:
$ sudo pstree -p 14090
bundle(14090)─┬─{cluster.rb:248}(14091)
├─{cluster.rb:284}(14150)
├─{connection_poo*}(14148)
├─{daemon.rb:42}(14147)
├─{default-executo}(14186)
├─{emitters.rb:238}(24311)
├─{grpc_global_tim}(14188)
├─{grpc_global_tim}(14201)
├─{grpc_global_tim}(18417)
├─{puma server}(14154)
├─{resolver-execut}(14187)
├─{ruby_sampler}(14146)
├─{scheduler.rb:73}(11667)
├─{service_discov*}(14149)
├─{thread_pool.rb*}(14152)
├─{thread_pool.rb*}(14153)
├─{thread_pool.rb*}(14189)
├─{thread_pool.rb*}(14190)
├─{thread_pool.rb*}(7786)
└─{thread_pool.rb*}(11665)
rbspy for that process doesn't show much, though lots of stacks appear to be doing gc related work.
And here is a flamegraph from perf
(sudo perf record -p 14090 -F97 -g -- sleep 30
):
Since this appears to only be happening on some processes (and only some hosts?) it may be good to grab some sort of heap dump to see where the memory is being allocated to.
Stackprof (gitlab-org/gitlab#225473 (closed)) may also be able to help provide some insights. It also has a memory mode. Though with this approach, we'll need to catch it in the act.
One possible approach for heap dumping: