Determine how we can add metrics for deployment health evaluation
Currently our gitlab_deployment_health
metric leverages EXISTING apdex and error SLO's of any given service to determine if that service is to be considered healthy. Currently there's nothing that allows us to expand this, remove one if we deem it unusable, or modify this easily.
We should consider that we may want to add other metrics to evaluate deployment health. Example, if we see a horrible spike in CPU usage after a canary deploy but the Adpex and Error SLO's are still w/i bounds, we risk deploying something which will either blow out capacity, or we'll start to suffer if the main stage is unable to cope with the load.
Another example. GitLab Agent for Kubernetes, currently has no Apdex. So right now, we have no health metric that takes anything other than errors into reason. This team may decide, or add, a new metric that is better suited, but is not an Apdex specific style of metric. Any service team may want to do the same.
Another example. A team adds a feature that they may want to keep a keen eye on. They could have created a dedicated set of metrics for said feature, and thus may play an integeral role in whether or not the service is healthy. Teams should be provided the ability to quickly add and remove metrics for health evaluation, and play with the weights that govern how much a given metric might play in whether or not a service is healthy.
Milestones
-
Consider segregating the current Deployment Health metric from our current apdex and error SLO's -
Create a mechanism in our metrics-catalog that enables anyone to target specific queries for evaluation of deployment health -
...