Prometheus Service Discovery for GitLab HA instances
Problem to solve
Today, it can be a challenge to set up Prometheus monitoring across a multi-node omnibus based GitLab installation. (We already have automatic service discovery in Kubernetes)
What can make this more challenging, is that your needs may be different depending on how you choose to deploy GitLab HA. For example, if you deploy GitLab in AWS you may be using features like Auto-scaling Groups which can elastically scale and restart GitLab nodes. This will impact the discovery solution.
Target audience
- Sidney, Systems Administrator, https://design.gitlab.com/research/personas#persona-sidney
Further details
Proposal
As an MVC - we should enhance our documentation to include additional discovery options like EC2 and Azure, which can detect the instances of GitLab and automatically scrape them. This would support dynamic workflows like auto-scaling groups, where underlying instances and addresses will change. (For example, here is EC2 discovery)
As a further iteration, we should consider a programmatic solution like DNS or Consul based SD. Consul may be the most broadly applicable, and we include it with GitLab EE packages already. We could have each node register itself with Consul along with what exporters it is running, and Prometheus can then go scrape it.
DNS could still also work seamlessly if an environment has been setup correctly, and wouldn't require the complexity of Consul, but not all environments may support this.
What does success look like, and how can we measure that?
Reduced manual configuration to enable Prometheus scraping of multiple GitLab nodes.