Some RubySampler metrics have no pid label

While doing some analysis in Thanos, I noticed that some of our RubySampler metrics have no pid label attached. This is a problem because all ruby_ metrics are inherently related to a single Ruby VM process. Without this label, every Puma or Sidekiq worker will independently contribute their data, thus giving an inaccurate account of what is happening.

Only non-gauge metrics are affected such as counters and histograms. Gauges happen not to be affected because by default we use the :all aggregation in prometheus-client-mmap, in which case that library injects a pid label for us. It does not do that for metrics that aren't gauges, however.

We should make sure that we discriminate process metrics between processes by pid (meaning, worker ID, not Linux process ID) and perform aggregations at the query level instead. Otherwise we cannot answer questions such as: "After puma_5 saw a spike in eden pages, how did that affect GC runtime?"

Example: ruby_sampler_duration_seconds_total

Edited Jul 12, 2022 by Matthias Käppler