Some RubySampler metrics have no pid label
While doing some analysis in Thanos, I noticed that some of our RubySampler
metrics have no pid
label attached. This is a problem because all ruby_
metrics are inherently related to a single Ruby VM process. Without this label, every Puma or Sidekiq worker will independently contribute their data, thus giving an inaccurate account of what is happening.
Only non-gauge metrics are affected such as counters and histograms. Gauges happen not to be affected because by default we use the :all
aggregation in prometheus-client-mmap, in which case that library injects a pid
label for us. It does not do that for metrics that aren't gauges, however.
We should make sure that we discriminate process metrics between processes by pid
(meaning, worker ID
, not Linux process ID) and perform aggregations at the query level instead. Otherwise we cannot answer questions such as: "After puma_5
saw a spike in eden pages, how did that affect GC runtime?"
Example: ruby_sampler_duration_seconds_total