DevOps Dashboard

Problem to solve

Enterprises need a dashboard to quickly check the health and progress of their DevOps transformation.

Intended users

Further details

After talking with some enterprises at DOES London, there was a trend with a few companies using a 4-metric dashboard, derived from the book Accelerate. The four metrics are:

Lead time for changes
Deployment frequency
Time to restore service (or MTTR)
Change failure rate

While these metrics aren't exhaustive of everything enterprises care about, there's value in providing a solid best practice by default and leveraging an external movement. Or as https://stelligent.com/2018/12/21/measuring-devops-success-with-four-key-metrics/ says "it’s extremely beneficial to finally have a canonical source for the relevant metrics that matter to organizations that is backed by data and analysis."

There's some ambiguity in the terms above. "Lead time" is sometimes defined as the wait time before work starts, but not including the actual work time. I think in this case, lead time is more equivalent to our definition of cycle time, meaning the time before a change is made available.

Even with that, there's still ambiguity whether it includes decision time while a change is just an idea and nobody has started working on it yet, vs just the time from starting work to shipping work. I think it refers to the latter, even though in reality, the time spent debating whether to attempt a change does contribute to longer cycle times.

To be clear(er), there are (at least) three different measures of time:

Time from creation of issue to deployment of MR to production.
Time from approving or scheduling an issue to deployment of MR to production.
Time from creation of MR to deployment of MR to production.

I think, but still need to validate, that the book means the third option.

For MTTR, I'm not sure how GitLab can measure this. How do we know there's a failure, and how do we know it's recovered? It can be easier once we have Incident Management. In the meantime, we could possibly use Monitoring alerts and measure the length of time of an alert, but that may not take into account false alarms.

Similar problem for change failure rate. We should be well suited to measure this, but I'm not sure we expose/capture the right information today.

Further iterations may include being able to drill down into each metric to understand root causes and opportunities for improvement. For example, when you drill into lead time, you get into great value stream management analytics which identify wait time vs work time.

Proposal

Create a Devops Dashboard page in the Analytics workspace with the 4 metrics in a form similar to the security dashboard here:

NB: These should be just in the form of widgets so that users can create as many as they want (they likely want to see the mean/median for both cycle time and lead time).

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

What is the type of buyer?

The person driving the DevOps transformation of a company, usually an executive.

Links / references

Edited Sep 05, 2019 by Virjinia Alexieva