Dogfood Metrics for a Single Critical GitLab.com Metric including creating alerts
Problem to solve
We aren't dogfooding metrics for meaningful infra workflows. We should so that we can start iterating more rapidly on improving the features capabilities.
Intended users
GitLab.com Infra Team
Further details
Proposal
My off the cuff proposal, but we should focus on the art of the possible here.
Option 1
Use an external prometheus server
- Create a project in ops.gitlab.net
- Attach an external Prometheus server
- Add custom metrics (SLO related ones preferrable)
- Add custom alerts which trigger incident issues
- Utilize the default issue template
Option 2
Use a GitLab managed prometheus server
- Create a project in ops.gitlab.net
- Add a kubernetes cluster, install a managed prometheus server
- Point GitLab.com metrics to both our production and this managed prometheus server
- Add the custom metric and alert
Permissions and Security
Documentation
Testing
What does success look like, and how can we measure that?
If we are actively utilizing, and creating issues to improve the GitLab monitor features within the infra teams.