Skip to content
GitLab Next
  • Menu
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • reliability reliability
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 1,334
    • Issues 1,334
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Insights
    • Issue
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.comGitLab.com
  • GitLab Infrastructure TeamGitLab Infrastructure Team
  • reliabilityreliability
  • Issues
  • #3307
Closed
Open
Issue created Nov 27, 2017 by Andrew Newdigate@andrewnOwner

Use cadvisor to monitor cgroups on the NFS servers

Gitaly is contained within a cgroup on the File Servers.

https://gitlab.com/gitlab-com/infrastructure/issues/2734 is about improving the metrics around cgroups. One of the suggestions is to use cAdvisor to monitor the cgroups on the file servers.

cAdvisor offers a Prometheus scape endpoint, so it works will with our existing monitoring infrastructure.

Additionally, since resources on the file servers are mainly consumed by three components: Gitaly, git processes and NSFd and we know host metrics and Gitaly metrics, having cgroup metrics would allow us to accurately guess the resources being consumed by NSFd and also the git processes.

For example:

NFS CPU Consumption = Total Used Host CPU - Gitaly Cgroup CPU

and

git process memory = Gitaly Cgroup Gitaly memory - Gitaly process memory

Having these metrics, and appropriate dashboards, could possibly help in diagnosing some of the issues we're seeing on the file servers.

Additionally, at present, we have very little insight into how frequently we're hitting the the limits we've set on our cgroups. Adding cAdvisor monitoring would improve this.

Assignee
Assign to
Time tracking