Create a metric to measure correlation of Corrective Actions/Infradev Issues to Incidents
As part of our OKR to achieve > 99.95% availability for primary services (excluding git access & CI Runners & Sidekiq) we would like to create metric to measure how well we're doing with creating CAs and InfraDev issues for Incidents over time. This idea originated as part of the discussion around creating a monthly process to review incident and alerting trends but it also add value in terms of accomplishing this OKR.
@ahanselka and I discussed this at length and we think the best option is to create a Sisense chart which we can then add to the SaaS Availability Standup as well as the Monthly Trend Review.
Here is a sample chart that demonstrate what we're after:
We'll need to enlist the help of the Engineering Analytics team to help us create the chart in Sisense. This issue is to track the things that we need to do to reach the desired result.
-
Decide what label we'll use to designate if an incident issue has a CA/InfraDev issue associated with it (I sort of like CA/InfraDev Associated) -
Create an automated process that shows how many incidents are missing a CA/InfraDev issue and list them out in some sort of report, potentially the Weekly Newsletter -
MR to update the Reliability process to indicate what labels need to be applied and when. -
MR to update the incident management and incident review processes to include this new process. -
Create issue on the Engineering Analytics board to create the new Sisense chart.
We can add to the above list if we discover further actions.
