Present Change failure ratio in value stream analytics

changed milestone to %Backlog

added Category:Continuous Delivery GitLab Ultimate analytics devopsrelease [DEPRECATED] direction typefeature labels

@sarahwaldner I this could be really interesting and it leverages alerts and incidents

If we enable users to create alerts for events such as these, it will be easy for us to generate and display these metrics for users. This will also help to drive usage from Release into Incident Management for Monitor :)

@sarahwaldner I think what I will need from your team is a way to get these metrics - ~"group::progressive delivery" will handle the display and aggregation. What say you?

@ogolowinski I like that plan. Let's talk through scope then we can address timing. I see that this is in the %Backlog - is this urgent for you? My team has started working on a new project, so if I were to move something like this into the build phase it would be at least a couple milestones.

So we want to display the Change failure ratio, which is the number of deployments over the number of failures which generated an alert or incident.

To calculate this we need to know:

Number of deployments - We already know this
Number of failures - Is this the number of failed deployments? Do you track this today?
Number of failures that generated an alert - We need to add alerting for pipeline events. This will involve some sort of config page where a user chooses to enable alerts on different types of events (e.g. failed deployment).

Questions for you:

Is it sufficient (or desired) to alert on all deployment failures? Do you think that users would want more granularity in their alerting to do something such as: Trigger alert after pipeline fails 3 times? (I imagine this would only be the case if it is possible to set up deployments to auto-retry. Is that possible?)
I think that users will want to be able to configure alerting on deployments by environment. Do you agree?

Thanks @ogolowinski !

@sarahwaldner we already know the number of deployments including how many failed/succeeded (we don't display it though)

I was thinking this issue should be Number of failures that generated an alert/incident and in a next iteration add

Mean time to resolve incidents - Mean time to resolve incidents over the period of time selected in the filter.

@ogolowinski Great!

So what you need from my team is data on the time to resolve and incident. I think that that should be fairly simple to calculate and add to the incidents table. Let me pull in some of my engineers.

@splattael @seanarnold I am collaborating with the Release team on metrics for release and response teams. We want to know the mean time to resolve incidents. The release team will take care of the aggregation and display of the metrics - we need to provide them with the time to resolve each incident. I think that they should be fairly simple for us to calculate (ended_at - started_at) - would we add this to the issues table?

@sarahwaldner

I think that they should be fairly simple for us to calculate (ended_at - started_at) - would we add this to the issues table?

Currently, an incident is a special type of an issue so I'd assume we consider a closed incident issue as resolved incident. If so, the mean time to resolve incidents could be calculated by closed_at - created_at (already available; no new columns are needed).

Thank you @splattael !

@ogolowinski I am not sure that you need my team to build anything. The data is already captured in the issues table! You can simply query for issues where issue_type=incident and substract closed_at - created_at. Thoughts?

@sarahwaldner awesome - so we are only left with

Number of failures that generated an alert - We need to add alerting for pipeline events. This will involve some sort of config page where a user chooses to enable alerts on different types of events (e.g. failed deployment).

@ogolowinski Oh yes, we clearly need that. Let's collaborate on that here => #217770 (closed)

I've added THIS issue as blocked by #217770 (closed)

@sarahwaldner I opened an issue for mean time to resolve incidents #254193 (closed)

@ogolowinski Thank you!

changed the description

marked this issue as related to #37139 (closed)

Another interesting metric to collect is

Mean time to resolve incidents - Mean time to resolve incidents over the period of time selected in the filter.

added to epic &4358 (closed)

added [deprecated] Accepting merge requests label

Setting label(s) ~"group::progressive delivery" sectionops based on Category:Continuous Delivery ~"group::progressive delivery".

added grouprelease [DEPRECATED] sectionops labels

mentioned in epic &4358 (closed)

marked this issue as related to #217770 (closed)

mentioned in issue #217770 (closed)

Customer interest

Enterprise customer expressed interest in this: https://gitlab.my.salesforce.com/0016100001TzOQc (internal link)
Why interested: at the moment concept of Rollbacks is not strong in GitLab, customer is not able to collect any information about failed deployments
Current solution: built custom solution or rely on qualitative feedback from engineers
Nice to have

thanks @dzalbo - as we start to have prototypes for these features, would this customer be interested in connecting?

@jreporter this is the same customer as in the Draft Release request.. so we could try to combine these conversation. Happy to arrange something.

Oh awesome, that would be great @dzalbo

added customer label

Users want to count how many times they rollback from produciton

@ogolowinski - I am curious if you think this could managed by #6187?

@jreporter We could probably leverage #6187 in order to display this or send via API - yes

thanks @ogolowinski

marked this issue as related to #280551

marked this issue as related to #210323 (closed)

mentioned in epic &5300 (closed)

mentioned in issue #210323 (closed)

changed the description

Customer interested in this: https://gitlab.my.salesforce.com/0064M00000YOZBw
Why: Considering ultimate eval with app sec, and DORA 4 metrics will support their ops visibility. Would like this seamlessly integrated without custom scripting between multiple systems

I am an Ultimate customer who is very very interested in this.

This appears to be a duplicate of [VSA] Add change failure rate (DORA), where that issue is owned by groupoptimize and this one is owned by ~"group::release" . It seems like the groupoptimize issue is a bit further along (has a mockup), and VSA is owned by groupoptimize, so I'm going to close this one as the duplicate.

closed

Present Change failure ratio in value stream analytics

Release notes

Problem to solve

Intended users

User experience goal

Proposal

Further details

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references

Designs

Child items ...

Activity

Present Change failure ratio in value stream analytics

Release notes

Problem to solve

Intended users

User experience goal

Proposal

Further details

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references

Is blocked by

Relates to

Activity