Capture and display types of deployment blockers on grafana
Context
We currently have deployment failures automatically captured under release tasks issues. Release managers get assigned to label them with appropriate RootCause::*
labels, so that by the start of the following week Monday, the deployments:blockers_report
scheduled pipeline under release/tools repo looks through the labelled issues, and makes a weekly deployment blockers issue, like this one.
We want to capture the trend of re-occuring root causes of the deployment blockers, so we can capture and display the changes in frequency as we make modifications to our processes. Recently, RootCause::Flakey-Tests
have been very frequent, and we want to keep track of the trend while we work alongside grouptesting to decrease this count.
This issue is to emit metrics to grafana that captures the types of root causes, and use that metric to create a dashboard panel with trends (weekly or monthly). Let's start with RootCause::Flakey-Tests
for now.
Implementation Ideas
This is just a draft of how to implement this metric to iterate upon. Essentially, we want to leverage a lot of what's already been implemented in release-tools repo.
- In release_tools repo, create a class
ReleaseTools::Deployments::BlockersMetrics
to emit adelivery_deployment_blocker
metric that contains the following information in a metric (Set value to 1, so we can sum it on grafana side)- root_cause_label
- created_at
- hours_gstg_blocked
- hours_gprd_blocked
- Implement a rake task in deployments.rake to create + execute that class
- Call that rake task in automation.gitlab-ci.yml, in
record-deployment-blockers
, afterblockers_report
andblockers_annotate
rake tasks - After the MR is merged, run
deployments:blockers_report
scheduled pipeline in ops release/tools. The metric should now appear on grafana (explore) - Use that metric to create a dashboard panel under the Release Management Toil dashboard (IaC under runbooks/dashboard)
- For the scope of this issue, let's keep to just the flakey tests count/trend per release. Table and/or time series graph?
Exit Criteria
-
Metric to capture the various information about deployment blockers are emitted to grafana -
A panel exists on grafana that shows the trend of RootCause::Flakey-Tests
deployment blocker issues.