Add infrastructure as known transient error (!2184) · Merge requests · GitLab.org / Quality Department / triage-ops

Jennifer Li requested to merge jennli-add-more-transient-error-cases into master Apr 18, 2023

What does this MR do and why?

Add master-brokeninfrastructure as known transient failure so we can retry the job and automatically close incident.

I also removed master-brokendependency-upgrade from the trasient error list, given the seg fault error caused by the last Ruby upgrade should be resolved by gitlab-org/gitlab-build-images!672 (merged). I think that master-brokendependency-upgrade shouldn't always result in transient errors, and if we do end up getting more seg fault errors, it may not be caused by Ruby upgrade anymore, so we should be careful with labeling these errors from this point on.

Expected impact & dry-runs

These are strongly recommended to assist reviewers and reduce the time to merge your change.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-policies-with-a-dry-run on how to perform dry-runs for new policies.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.

Action items

If adding environment variables for reactive processors, update config/triage-web.yaml and .gitlab/ci/triage-web.yml
(If applicable) Add documentation to the handbook pages for Triage Operations =>
(If applicable) Identify the affected groups and how to communicate to them:
- /cc @person_or_group =>
- Relevant Slack channels =>
- Engineering week-in-review

Edited Apr 18, 2023 by Jennifer Li

Add infrastructure as known transient error

What does this MR do and why?

Expected impact & dry-runs

Action items

Merge request reports