GitLab.com Infra Dogfooding of Metrics and Incidents
Exploration issue of what it would take for GitLab.com's Infra team to start dogfooding metrics. Our current capabilities are documented [here](https://docs.gitlab.com/ee/user/project/integrations/prometheus.html). ## Intent The intent is not to use GitLab.com as a representative sample of our customers size and complexity, but to [put the system thru its paces and generate rapid feedback on direction](https://about.gitlab.com/handbook/product/#dogfood-everything). ## Tasks * [x] Identify stakeholders * [x] Ensure common understanding of intent * [x] Understand current pain points, rationale for not using * [x] Discuss possible initial steps (project with single metric with GitLab dashboard) * [x] Populate this epic with actions ## Actions **design.gitlab.com** * [x] Investigate [design.gitlab.com's project](https://gitlab.com/gitlab-org/gitlab-services/design.gitlab.com/environments/269942/metrics) to see if [dashboards](https://gitlab.com/gitlab-org/gitlab-services/design.gitlab.com/environments/269942/metrics) and alarms are setup - @kencjohnston * [ ] If not setup alarms to create incidents **versions.gitlab.com** * [ ] Move to ADO * [ ] Configure metrics and alerts * [ ] Create incidents from alerts **license.gitlab.com** * [ ] Move to ADO * [ ] Configure metrics and alerts * [ ] Create incidents from alerts **Other** * [ ] Educate SRE/Infra team on Ops capabilities and vision - @kencjohnston * [ ] Create issue for better display of alarm threshold time windows (like DataDogs) - @kencjohnston * [ ] Create issue - Allow for source controlled alertmanager config to easily be updated on prometheus cluster during deploy - @kencjohnston Key questions in threads below.
epic