As an executive investing in DevOps, I want to see my ROI. I want to measure and monitor Change failure ratio in the deployment frequency of my dev team. This will most likely drive the KPIs of my team.
Change failure ratio - Deployments with failures divided by total deployments
If we enable users to create alerts for events such as these, it will be easy for us to generate and display these metrics for users. This will also help to drive usage from Release into Incident Management for Monitor :)
@sarahwaldner I think what I will need from your team is a way to get these metrics - ~"group::progressive delivery" will handle the display and aggregation. What say you?
@ogolowinski I like that plan. Let's talk through scope then we can address timing. I see that this is in the %Backlog - is this urgent for you? My team has started working on a new project, so if I were to move something like this into the build phase it would be at least a couple milestones.
So we want to display the Change failure ratio, which is the number of deployments over the number of failures which generated an alert or incident.
To calculate this we need to know:
Number of deployments - We already know this
Number of failures - Is this the number of failed deployments? Do you track this today?
Number of failures that generated an alert - We need to add alerting for pipeline events. This will involve some sort of config page where a user chooses to enable alerts on different types of events (e.g. failed deployment).
Questions for you:
Is it sufficient (or desired) to alert on all deployment failures? Do you think that users would want more granularity in their alerting to do something such as: Trigger alert after pipeline fails 3 times? (I imagine this would only be the case if it is possible to set up deployments to auto-retry. Is that possible?)
I think that users will want to be able to configure alerting on deployments by environment. Do you agree?
So what you need from my team is data on the time to resolve and incident. I think that that should be fairly simple to calculate and add to the incidents table. Let me pull in some of my engineers.
@splattael@seanarnold I am collaborating with the Release team on metrics for release and response teams. We want to know the mean time to resolve incidents. The release team will take care of the aggregation and display of the metrics - we need to provide them with the time to resolve each incident. I think that they should be fairly simple for us to calculate (ended_at - started_at) - would we add this to the issues table?
I think that they should be fairly simple for us to calculate (ended_at - started_at) - would we add this to the issues table?
Currently, an incident is a special type of an issue so I'd assume we consider a closed incident issue as resolved incident. If so, the mean time to resolve incidents could be calculated by closed_at - created_at (already available; no new columns are needed).
@ogolowinski I am not sure that you need my team to build anything. The data is already captured in the issues table! You can simply query for issues where issue_type=incident and substract closed_at - created_at. Thoughts?
Number of failures that generated an alert - We need to add alerting for pipeline events. This will involve some sort of config page where a user chooses to enable alerts on different types of events (e.g. failed deployment).
Why: Considering ultimate eval with app sec, and DORA 4 metrics will support their ops visibility. Would like this seamlessly integrated without custom scripting between multiple systems
This appears to be a duplicate of [VSA] Add change failure rate (DORA), where that issue is owned by groupoptimize and this one is owned by ~"group::release" . It seems like the groupoptimize issue is a bit further along (has a mockup), and VSA is owned by groupoptimize, so I'm going to close this one as the duplicate.