[spike] experiment with tracking environments' ability to receive a deployment

This issue is extracted from #19896 (comment 1689037581). We would like to experiment with tracking some of the proposed information in delivery-metrics and see how we could extract data from a dashboard.

Because the primary purpose is testing the visualization and data extraction, we may cut the scope by not implementing the state machine transition rules and focusing on gathering as much state as possible for a few days/weeks.

Tracking environments' ability to deploy

I was thinking of a way to model the state of each environment by its ability to receive a deployment. In an ideal world, in Staging Canary, we can imagine a simple loop like the following one.

Each state has a clear owner who could be held accountable for the time spent on that state. I.e. if there is "no package to deploy" the Deployment System team should consider improving the package tagging time (or make it reactive). If we are spending too much time in "package building" we can ask Distribution to see if it is possible to speed up build times, and so on.

As soon as we introduce another environment, Production Canary, we have another state, "Waiting QA", for packages running QA in the previous environment.

Adding Production Main will also introduce the "Baking Time" and "Waiting Promotion" states.

Each of the above states has a clear DRI (internal or external to the team) that we can interface with to improve the situation if needed.

In the following table an example of how the state of an environment affects the next one in the simplified use case of producing a single package.

On top of the above-described states we will have several blocking states to identify the inability to deploy. i.e.:

Environment locked by Change Request
Environment locked by Incident
Environment locked by Post Deployment Migration
QA Failure, Package not ready
Blocked, waiting for a specific fix (i.e. pick into auto-deploy)

If we could implement a metric that keeps track of all the above states, we would have a detailed breakdown of external blockers and the execution time of regular steps. That information could easily contribute to tracking things like:

Deployment Safety (How fast can we recover from a change failure)

Hours of deployment disruption

Deployment efficiency (Do we need to Separate from packaging here for a number)

Release Manager Toil

Edited Dec 12, 2023 by Alessio Caiazza