Some puma processes consuming lots of memory

Summary of findings

Problem: What occasionally causes a puma worker to enormously bloat its memory usage to the point where kernel must OOM kill it?
Answer: An exception handler (sentry) was trying to generate an 80 GB string.
- The exception handler was trying to compose a string to show the state of objects it thought were relevant.
- In doing so, it recursively descended into the object hierarchy of the in-memory prometheus metrics, specifically a set of thousands of Prometheus::Client::MmapedValue objects.
- Each of those MmapedValue objects points to a shared data structure (MmapedDict which wraps MmapedFile) that contains all stored metrics. So the string representation of each MmapedValue object included a copy of the entire 16 MB MmapedFile object, as represented by the @positions attribute of the singleton MmapedDict object.
- So we don't know what exception was being thrown (because the string never finished being generated). But @stanhu has merged a fix to MmapedDict (https://gitlab.com/gitlab-org/prometheus-client-mmap/-/merge_requests/51/diffs) to make its inspect method return just its object id rather than its complete contents. That should drastically reduce the size of the exception message, so next time this happens, it should be small enough that Sentry actually gets to log the exception rather than starve for memory.

Original details when we started this investigation

@andrewn posted this:

@sre-oncall I stumbled across this unusual memory usage pattern on web-18. https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=63&orgId=1

7 day view:

ps shows pid 14090 comm puma: cluster worker 5 using lots of memory (and cpu):

git      10709 31.4  1.4 5083860 954136 ?      Ssl  Jul17 1249:20  |   \_ puma 4.3.3.gitlab.2 (unix:///var/opt/gitlab/gitlab-rails/sockets/gitlab.socket,tcp://0.0.0.0:8080) [gi
git      14022 32.3  3.1 6589264 2047336 ?     Sl   04:03 186:01  |       \_ puma: cluster worker 0: 10709 [gitlab-puma-worker]
git      14024 32.2  3.7 6891540 2446168 ?     Sl   04:03 185:28  |       \_ puma: cluster worker 1: 10709 [gitlab-puma-worker]
git      14086 33.5  2.9 6368668 1974008 ?     Sl   04:03 192:49  |       \_ puma: cluster worker 3: 10709 [gitlab-puma-worker]
git      14088 32.6  2.8 6455624 1885604 ?     Sl   04:03 187:46  |       \_ puma: cluster worker 4: 10709 [gitlab-puma-worker]
git      14090 37.8 55.4 91223712 36604016 ?   Sl   04:03 217:59  |       \_ puma: cluster worker 5: 10709 [gitlab-puma-worker]
git      14092 33.9  2.7 6386488 1815192 ?     Sl   04:03 195:27  |       \_ puma: cluster worker 6: 10709 [gitlab-puma-worker]
git      14213 34.2  2.5 6338872 1702756 ?     Sl   04:03 197:11  |       \_ puma: cluster worker 8: 10709 [gitlab-puma-worker]
git      32623 41.2  2.0 6224160 1382136 ?     Sl   07:44 146:17  |       \_ puma: cluster worker 7: 10709 [gitlab-puma-worker]
git       1284 39.2  2.0 6054716 1352964 ?     Sl   08:40 117:10  |       \_ puma: cluster worker 14: 10709 [gitlab-puma-worker]
git       2268 41.9  2.1 6548104 1420500 ?     Sl   08:43 124:08  |       \_ puma: cluster worker 13: 10709 [gitlab-puma-worker]
git       3234 40.0  2.1 6123304 1405672 ?     Sl   08:45 117:30  |       \_ puma: cluster worker 15: 10709 [gitlab-puma-worker]
git       6609 39.7  2.0 6669448 1361928 ?     Sl   08:54 113:07  |       \_ puma: cluster worker 12: 10709 [gitlab-puma-worker]
git      12988 38.8  2.1 6183048 1445100 ?     Sl   09:10 104:35  |       \_ puma: cluster worker 9: 10709 [gitlab-puma-worker]
git      25519 39.9  2.1 6144648 1444748 ?     Sl   09:27 100:43  |       \_ puma: cluster worker 2: 10709 [gitlab-puma-worker]
git      18468 38.7  2.0 6036620 1379912 ?     Sl   10:17  78:05  |       \_ puma: cluster worker 11: 10709 [gitlab-puma-worker]
git      24371 49.1  1.8 5467272 1223292 ?     Sl   13:14  12:01  |       \_ puma: cluster worker 10: 10709 [gitlab-puma-worker]

pstree:

$ sudo pstree -p 14090

bundle(14090)─┬─{cluster.rb:248}(14091)
              ├─{cluster.rb:284}(14150)
              ├─{connection_poo*}(14148)
              ├─{daemon.rb:42}(14147)
              ├─{default-executo}(14186)
              ├─{emitters.rb:238}(24311)
              ├─{grpc_global_tim}(14188)
              ├─{grpc_global_tim}(14201)
              ├─{grpc_global_tim}(18417)
              ├─{puma server}(14154)
              ├─{resolver-execut}(14187)
              ├─{ruby_sampler}(14146)
              ├─{scheduler.rb:73}(11667)
              ├─{service_discov*}(14149)
              ├─{thread_pool.rb*}(14152)
              ├─{thread_pool.rb*}(14153)
              ├─{thread_pool.rb*}(14189)
              ├─{thread_pool.rb*}(14190)
              ├─{thread_pool.rb*}(7786)
              └─{thread_pool.rb*}(11665)

rbspy for that process doesn't show much, though lots of stacks appear to be doing gc related work.

And here is a flamegraph from perf (sudo perf record -p 14090 -F97 -g -- sleep 30):

Since this appears to only be happening on some processes (and only some hosts?) it may be good to grab some sort of heap dump to see where the memory is being allocated to.

Stackprof (gitlab-org/gitlab#225473 (closed)) may also be able to help provide some insights. It also has a memory mode. Though with this approach, we'll need to catch it in the act.

One possible approach for heap dumping:

cc @andrewn @reprazent @ayufan @mkaeppler @msmiley

Edited Jul 22, 2020 by Matt Smiley