Invalid the Puma metrics on dev and pre
Summary
We deployed Puma on dev.gitlab.org
. While reviewing metrics I noticed that all
puma_killer_terminations_total
is set to 0
, and also incomplete:
# HELP puma_killer_terminations_total Multiprocess metric
# TYPE puma_killer_terminations_total counter
puma_killer_terminations_total{worker="worker_10"} 0
puma_killer_terminations_total{worker="worker_13"} 0
puma_killer_terminations_total{worker="worker_14"} 0
puma_killer_terminations_total{worker="worker_4"} 0
puma_killer_terminations_total{worker="worker_6"} 0
puma_killer_terminations_total{worker="worker_7"} 0
puma_killer_terminations_total{worker="worker_9"} 0
This seems to contradicts with (does it?):
# HELP ruby_process_start_time_seconds Multiprocess metric
# TYPE ruby_process_start_time_seconds gauge
ruby_process_start_time_seconds{pid="puma_0"} 1564384955
ruby_process_start_time_seconds{pid="puma_1"} 1564384955
ruby_process_start_time_seconds{pid="puma_10"} 1564406017
ruby_process_start_time_seconds{pid="puma_11"} 1564384955
ruby_process_start_time_seconds{pid="puma_12"} 1564384955
ruby_process_start_time_seconds{pid="puma_13"} 1564406217
ruby_process_start_time_seconds{pid="puma_14"} 1564402736
ruby_process_start_time_seconds{pid="puma_15"} 1564384955
ruby_process_start_time_seconds{pid="puma_2"} 1564384955
ruby_process_start_time_seconds{pid="puma_3"} 1564384955
ruby_process_start_time_seconds{pid="puma_4"} 1564405757
ruby_process_start_time_seconds{pid="puma_5"} 1564384955
ruby_process_start_time_seconds{pid="puma_6"} 1564408497
ruby_process_start_time_seconds{pid="puma_7"} 1564394256
ruby_process_start_time_seconds{pid="puma_8"} 1564384955
ruby_process_start_time_seconds{pid="puma_9"} 1564405857
Does it mean that none of Puma workers were restarted?
Secondly, the puma_killer_terminations_total
does not make sense in context
of worker, but only in context of master
.
I also noticed the puma_phase
on other puma metrics to be duplicated per-worker
and per-pid. We should only see pid=puma_master
with livesum, and multiple
workers
, as we gather metrics on Puma master thread.
puma_phase{worker="worker_12",pid="puma_10"} 0
puma_phase{worker="worker_12",pid="puma_13"} 0
puma_phase{worker="worker_12",pid="puma_14"} 0
puma_phase{worker="worker_12",pid="puma_4"} 0
puma_phase{worker="worker_12",pid="puma_6"} 0
puma_phase{worker="worker_12",pid="puma_7"} 0
puma_phase{worker="worker_12",pid="puma_9"} 0
puma_phase{worker="worker_12",pid="puma_master"} 0
What is the current bug behavior?
- Is
puma_killer_terminations_total
? Did we not restart any of Puma workers, - We have invalid duplicate
puma_phase
withpid!=puma_master
and duplicate number of workers, does it mean that somehow we leakPumaSampler
into forks? - Is
ruby_process_start_time_seconds
accurate? Does it mean that we did not restart any of Puma workers? - When running the
changes(ruby_process_start_time_seconds[1m]{fqdn=~".*puma.*"})
seems to not correspond to the values ofpuma_killer_terminations_total
? Do we even need thepuma_killer_terminations_total
if we havechanges()
?
Summary (13-Aug-2019):
-
puma_killer_terminations_total
were not working, as we wiped the counter for master process entirely.
Fix MR (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31668) is merged and the metric is non-zero ondev
: https://gitlab.com/gitlab-org/gitlab-ce/issues/65278#note_203235627 - :livesum for
puma_phases
MR: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31773 (need to check it locally) -
ruby_process_start_time_seconds
seems correct to me
Edited by Aleksei Lipniagov