Revive Gitlab Deployment Health Indicator metric
Context
Some time ago, we created Prometheus silenced alerts for deployment go/no-go to binary encode our perception of the production environment regarding the ability to deploy or not. This effort output formalized the gitlab_deployment_health
metric.
Problem
This metric was developed outside of the team and it is currently used in our Release tooling. E.g., when a canary deployment induces an error, it will be reflected in apdex and error rates, and the production deployment will not be allowed. Currently, the metric is a bit orphaned and we should treat it as a first-class citizen.
Goal
-
Document the metrics and how it is used by the release-tools/deployer -
Verify alerts based on metrics are currently working -
Use metrics in a dashboard (e.g.: Release Management dashboard) to understand current trend (e.g. % of time healthy and unhealthy/duty cycle, within a timeframe, last 24 hours, since last deploy)
Edited by Michele Bursi