Dogfood Metrics for a Single Critical GitLab.com Metric including creating alerts

Problem to solve

We aren't dogfooding metrics for meaningful infra workflows. We should so that we can start iterating more rapidly on improving the features capabilities.

Intended users

GitLab.com Infra Team

Further details

Proposal

My off the cuff proposal, but we should focus on the art of the possible here.

Option 1

Use an external prometheus server

  • Create a project in ops.gitlab.net
  • Attach an external Prometheus server
  • Add custom metrics (SLO related ones preferrable)
  • Add custom alerts which trigger incident issues
  • Utilize the default issue template

Option 2

Use a GitLab managed prometheus server

  • Create a project in ops.gitlab.net
  • Add a kubernetes cluster, install a managed prometheus server
  • Point GitLab.com metrics to both our production and this managed prometheus server
  • Add the custom metric and alert

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

If we are actively utilizing, and creating issues to improve the GitLab monitor features within the infra teams.

Links / references