The Auto DevOps pipeline metric is too noisy to be useful

Problem to solve

The Auto DevOps completed pipelines metric (internal link) is too noisy. At current volumes, it can only detect complete failures, but those are very rare.

We also have no baseline for what is a normal amount of success, failures etc.

Note: The problem extends to any "pipeline template" we have, i.e. templates to be used with include:template (except there we have no visibility at all).

Background

There are two reasons for the noise:

user control: Pipelines frequently fail due to user error (e.g. code errors cause builds or deployments to fail).
low volume: A single user running a lot of pipelines can skew the numbers significantly.

This is a proposed corrective action for gitlab-com/gl-infra/production#4288 (comment 558175736)

We didn't see anything related to this in our monitoring. Do our current metrics cover this failure?

Ideas

Add a rate limited metric, limited to the default branch

Add a counter auto_devops_hourly_pipeline_status_total with the label status, set to the pipeline.status, and only increment it if the previous Auto DevOps pipeline in the same project was created more than 1 hour ago. Limit this metric to the default branch.

Hypothesis: By rate limiting the counting to 1 pipeline per project per hour, we reduce the representation of any single project, while increasing the representation of the system as a whole. And by limiting to the default branch, we only count errors that affect production deployments. This should improve our ability to detect incidents affecting only a portion of users. In particular, it should detect gitlab-com/gl-infra/production#4288 (closed).

Before starting work: Sample a few hours of Auto DevOps pipelines to validate/reject the hypothesis. To be worthwhile, the metric should be able to clearly detect the incident at gitlab-com/gl-infra/production#4288 (closed)

Generic monitoring for CI/CD templates?

As the problem is not specific to Auto DevOps, but rather applies to all templates designed to be used with include:template, a more generic solution could be useful.

Is monitoring entire pipeline success rate useful?
Would job level monitoring be more useful?
How can we better deal with user errors?

Edited Mar 28, 2022 by Hordur Freyr Yngvason