Understand why scheduled master pipelines successful rates are very low

Since we didn't have pipeline stability dashboard on Snowflake which we can play around with, I decided to create one: https://app.snowflake.com/ys68254/gitlab/#/pipeline-successful-rates-d4GqRCjeZ

Insert the issue where I talked about why I think tracking retries might reflect productivity better:

A few conclusions I draw from this dashboard:

Looking at all master pipelines, the successful rates are good (95% ~ 98%), and NOT retried rates are a bit lower but close enough (94% ~ 95%)
It's a much different story for scheduled master pipelines though:
- For nightly it's much varying, from 15% to 50%, mostly 25% recently. We can safely say that it's extremely unstable
- For 2-hourly it's very surprising: The successful rate is also very bad, roughly at 50%, however if we look at retries, starting from June we basically retried ALL of the pipelines, and even then we can only keep it at 50% successful.

Why? Let's find the answers for the following questions:

Why are scheduled pipelines successful rates so low? Nightly about 25% and 2-hourly about 50%
What's the impact of this disparity?
Did we really retry ALL scheduled 2-hourly pipelines on June and July?
Why are we retrying so hard but the rates are still quite low? Is retrying helpful? Or is it giving us some illusions?
How can we measure how pipeline stability affects productivity in people's daily merge requests?
- There are a lot of reasons a merge request pipeline can fail, for example it can be legitimate bugs, but here's the charts for merge requests as a reference: To reiterate, I think looking at how often merge requests pipeline are retried can kind of reflect productivity lose from pipeline instability and people's lack of confidence about it.

Edited Aug 02, 2024 by Lin Jen-Shin