Skip to content

Invalid the Puma metrics on dev and pre

Summary

We deployed Puma on dev.gitlab.org. While reviewing metrics I noticed that all puma_killer_terminations_total is set to 0, and also incomplete:

# HELP puma_killer_terminations_total Multiprocess metric
# TYPE puma_killer_terminations_total counter
puma_killer_terminations_total{worker="worker_10"} 0
puma_killer_terminations_total{worker="worker_13"} 0
puma_killer_terminations_total{worker="worker_14"} 0
puma_killer_terminations_total{worker="worker_4"} 0
puma_killer_terminations_total{worker="worker_6"} 0
puma_killer_terminations_total{worker="worker_7"} 0
puma_killer_terminations_total{worker="worker_9"} 0

This seems to contradicts with (does it?):

# HELP ruby_process_start_time_seconds Multiprocess metric
# TYPE ruby_process_start_time_seconds gauge
ruby_process_start_time_seconds{pid="puma_0"} 1564384955
ruby_process_start_time_seconds{pid="puma_1"} 1564384955
ruby_process_start_time_seconds{pid="puma_10"} 1564406017
ruby_process_start_time_seconds{pid="puma_11"} 1564384955
ruby_process_start_time_seconds{pid="puma_12"} 1564384955
ruby_process_start_time_seconds{pid="puma_13"} 1564406217
ruby_process_start_time_seconds{pid="puma_14"} 1564402736
ruby_process_start_time_seconds{pid="puma_15"} 1564384955
ruby_process_start_time_seconds{pid="puma_2"} 1564384955
ruby_process_start_time_seconds{pid="puma_3"} 1564384955
ruby_process_start_time_seconds{pid="puma_4"} 1564405757
ruby_process_start_time_seconds{pid="puma_5"} 1564384955
ruby_process_start_time_seconds{pid="puma_6"} 1564408497
ruby_process_start_time_seconds{pid="puma_7"} 1564394256
ruby_process_start_time_seconds{pid="puma_8"} 1564384955
ruby_process_start_time_seconds{pid="puma_9"} 1564405857

Does it mean that none of Puma workers were restarted?

Secondly, the puma_killer_terminations_total does not make sense in context of worker, but only in context of master.

I also noticed the puma_phase on other puma metrics to be duplicated per-worker and per-pid. We should only see pid=puma_master with livesum, and multiple workers, as we gather metrics on Puma master thread.

puma_phase{worker="worker_12",pid="puma_10"} 0
puma_phase{worker="worker_12",pid="puma_13"} 0
puma_phase{worker="worker_12",pid="puma_14"} 0
puma_phase{worker="worker_12",pid="puma_4"} 0
puma_phase{worker="worker_12",pid="puma_6"} 0
puma_phase{worker="worker_12",pid="puma_7"} 0
puma_phase{worker="worker_12",pid="puma_9"} 0
puma_phase{worker="worker_12",pid="puma_master"} 0

What is the current bug behavior?

  1. Is puma_killer_terminations_total? Did we not restart any of Puma workers,
  2. We have invalid duplicate puma_phase with pid!=puma_master and duplicate number of workers, does it mean that somehow we leak PumaSampler into forks?
  3. Is ruby_process_start_time_seconds accurate? Does it mean that we did not restart any of Puma workers?
  4. When running the changes(ruby_process_start_time_seconds[1m]{fqdn=~".*puma.*"}) seems to not correspond to the values of puma_killer_terminations_total? Do we even need the puma_killer_terminations_total if we have changes()?

Summary (13-Aug-2019):

  1. puma_killer_terminations_total were not working, as we wiped the counter for master process entirely.
    Fix MR (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31668) is merged and the metric is non-zero on dev: https://gitlab.com/gitlab-org/gitlab-ce/issues/65278#note_203235627
  2. :livesum for puma_phases MR: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31773 (need to check it locally)
  3. ruby_process_start_time_seconds seems correct to me
Edited by Aleksei Lipniagov