DORA CFR - improve the metric calculation to effectively track incidents and deployments.
Overview
Change failure rate (CFR) is how often a change cause failure in production.
In GitLab, change failure rate is measured as the percentage of deployments that cause an incident in production. GitLab calculates this as the number of incidents divided by the number of deployments to a production environment.
Problem
Today incidents in the "Change failure rate" calculation rule are not associated with environment or deployment at all.
- If you have 10 deployments (1 deployment / day) and have 1 incident on the first day and 1 incident on the last day then your CFR is 0.2
- If you have 10 deployments where two incidents happen on the first day (when the first deployment happens) then your CFR is still 0.2 (which is wrong, the correct answer should be 0.1)
Also, it's not clear how incident time tracked in DORA. How many incidents are allowed per deployment?
Related to this Customer feedback "Basically, the numbers on the DORA metrics would be much more useful if they were qualified with their universe size/how the answer was gotten. In this case, knowing there were 8 incidents and 5 releases so it yields 1.0 would be valuable. Or... what if I have 20 incidents on 1 release and 0 on the next 19 releases - is it still 1.0 even though 19/20 were clean?"
Proposal
- Option 1 - Automatic guess of deployment.
- Option 2 - Adding manual link between an incident and deployment.
- Option 3 - Adding a link between an incident and MR