Proposal: add a "master-broken::job performance" incident root cause

I'm seeing an increased number of incidents caused by spec job exceeding the 90min threshold, such as https://gitlab.com/gitlab-org/gitlab/-/jobs/4808076620. This could be related to a number of reasons, such as bad test performance, Knapsack failed to evenly distribute specs, etc. I think we can benefit from collecting data on how often these slow jobs contribute to the overall pipeline instability.

I have two proposals:

Make it master-broken::job performance
Make it flaky-test::test performance

Which one makes more sense?

@gl-quality/eng-prod I'd like to hear your thoughts please, thanks!

Update: We will proceed with master-brokenjob-timeout

Required Actions

Update master broken handbook page
Update broken_master_incidents Sisense query
Add label to gitlab.org
Update master-broken-incidents insights.yml

Edited Aug 11, 2023 by Jennifer Li