Use cadvisor to monitor cgroups on the NFS servers
Gitaly is contained within a cgroup on the File Servers.
https://gitlab.com/gitlab-com/infrastructure/issues/2734 is about improving the metrics around cgroups. One of the suggestions is to use cAdvisor
to monitor the cgroups on the file servers.
cAdvisor offers a Prometheus scape endpoint, so it works will with our existing monitoring infrastructure.
Additionally, since resources on the file servers are mainly consumed by three components: Gitaly, git processes and NSFd and we know host metrics and Gitaly metrics, having cgroup metrics would allow us to accurately guess the resources being consumed by NSFd and also the git processes.
For example:
NFS CPU Consumption = Total Used Host CPU - Gitaly Cgroup CPU
and
git process memory = Gitaly Cgroup Gitaly memory - Gitaly process memory
Having these metrics, and appropriate dashboards, could possibly help in diagnosing some of the issues we're seeing on the file servers.
Additionally, at present, we have very little insight into how frequently we're hitting the the limits we've set on our cgroups. Adding cAdvisor monitoring would improve this.