Skip to content

Prometheus Metrics for Ruby Garbage Collection are wrong

Code is at https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/metrics/samplers/ruby_sampler.rb#L67-68

Currently, the way we record Ruby GC stats is not working correctly.

This comes out of a discussion with @msmiley about why our ruby_gc_duration_seconds_total metric (amongst others) does not have a worker dimension. Which worker are we seeing the GC events taking place for? One specific one? Or all workers?

According to our metrics, our processes only record occasional GC events (ie every few hours). This is not possible and possibly indicates that our Ruby GC prometheus metrics are not working as expected.

image

https://prometheus-app.gprd.gitlab.net/graph?g0.range_input=12h&g0.expr=rate(ruby_gc_duration_seconds_total%7Benvironment%3D%22gprd%22%2Cfqdn%3D%22api-03-sv-gprd.c.gitlab-production.internal%22%2Cinstance%3D%22api-03-sv-gprd.c.gitlab-production.internal%3A8080%22%2Cjob%3D%22gitlab-unicorn%22%2Cstage%3D%22main%22%2Ctier%3D%22sv%22%2Ctype%3D%22api%22%7D%5B5m%5D)&g0.tab=0

Slack thread: https://gitlab.slack.com/archives/CB7P5CJS1/p1561111536271600?thread_ts=1561017355.155900&cid=CB7P5CJS1

cc @sengelhard @bjk-gitlab

Edited by 🤖 GitLab Bot 🤖