Proposal: add a "master-broken::job performance" incident root cause
I'm seeing an increased number of incidents caused by spec job exceeding the 90min threshold, such as https://gitlab.com/gitlab-org/gitlab/-/jobs/4808076620. This could be related to a number of reasons, such as bad test performance, Knapsack failed to evenly distribute specs, etc. I think we can benefit from collecting data on how often these slow jobs contribute to the overall pipeline instability.
I have two proposals:
- Make it
master-broken::job performance
- Make it
flaky-test::test performance
Which one makes more sense?
@gl-quality/eng-prod I'd like to hear your thoughts please, thanks!
Update: We will proceed with master-brokenjob-timeout
Required Actions
-
Update master broken handbook page -
Update broken_master_incidents Sisense query -
Add label to gitlab.org
-
Update master-broken-incidents insights.yml
Edited by Jennifer Li