Improve alerts on nfs servers
During testing of gitaly cgroups (#2511 (closed)) we found few ways where our monitoring can be improved (https://gitlab.com/gitlab-com/infrastructure/issues/2511#note_39614774). This issue tracks those steps:
-
Add "number of process" graph, including running, sleeping and zombies. -
Define a sane alert threshold for running and zombie processes number. -
Set memory alerts when gitaly uses 30G or RAM (hard limit with cgroups is 32G now) -
Disable CPU alerts (or make them less noisy). Cgroup limits now take care of it, system will be always responsive, gitaly can use all the compute power it wants. -
(possibly) Alert on OOM invocations -
(possibly) do we need cgroups_*
metrics export?
@gl-infra anything else I have missed? /cc @bjk-gitlab @andrewn