Dogfood Metrics for a Single Critical GitLab.com Metric including creating alerts

Problem to solve

We aren't dogfooding metrics for meaningful infra workflows. We should so that we can start iterating more rapidly on improving the features capabilities.

Intended users

GitLab.com Infra Team

Further details

Proposal

My off the cuff proposal, but we should focus on the art of the possible here.

Option 1

Use an external prometheus server

Create a project in ops.gitlab.net
Attach an external Prometheus server
Add custom metrics (SLO related ones preferrable)
Add custom alerts which trigger incident issues
Utilize the default issue template

Option 2

Use a GitLab managed prometheus server

Create a project in ops.gitlab.net
Add a kubernetes cluster, install a managed prometheus server
Point GitLab.com metrics to both our production and this managed prometheus server
Add the custom metric and alert

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

If we are actively utilizing, and creating issues to improve the GitLab monitor features within the infra teams.