Design dashboards for discovering deployment blockers

Summary

We have 3 metrics as the output of the spike to experiment with tracking environments' ability to receive a deployment.

delivery_auto_deploy_package_state

This metric tracks the state of every auto-deploy package through the following states: missing, pending, building, ready, failed.
delivery_auto_deploy_environment_state

This metric will be used to track the state of each environment, like ready (ready to receive a deployment), locked, awaiting_promotion (gstg/gprd can be in this state when there is a package ready to be promoted), baking_time (gprd-cny can be in this state when a package is baking on it).
delivery_auto_deploy_env_lock_state

This metric will track why an environment is in the locked state. The previous metric (delivery_auto_deploy_environment_state) only tracks if an environment is locked. This metric will track why the environment is locked. This can be locked due to ongoing deployment, post-deploy migration, QA. In future iterations, we can also track when an environment is locked due to an incident or change request.

This issue is for discussing what we would like to see in a Grafana dashboard. What will help us discover where to spend our efforts in trying to reduce the time that deployments are blocked?

Edited Mar 06, 2024 by Reuben Pereira