PoC: Introduce a processor to create incidents from failing pipelines (!2111) · Merge requests · GitLab.org / Quality Department / triage-ops

Rémy Coutable requested to merge introducde-a-processor-to-create-incident-for-failing-pipeline into master Mar 23, 2023

What does this MR do and why?

This MR currently does 3 things:

Support pipeline events: !2116 (merged)
Introduce classes to create incidents from pipeline events
Introduce a processor to create incidents from failing pipelines

The goal is to replace the jobs and scripts from the main project, since "broken master" managements shouldn't be the responsibility of the main project, but is an Ops thing instead.

This effectively replaces:

The notify-pipeline-failure job (except for the Slack notification part for now)
The review-deploy-failure-notification job (except for the Slack notification part for now)
The scripts/create-pipeline-failure-incident.rb script

Still to do

Move the two first two commits to separate MRs.
Handle the Slack notification after creating the incident
Handle ruby2 branch use-case
Allow to post to Slack without creating incident (currently done for stable and ruby2 branches

Expected impact & dry-runs

These are strongly recommended to assist reviewers and reduce the time to merge your change.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-with-a-dry-run on how to perform dry-runs for new policies.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.

Action items

If adding environment variables for reactive processors, update config/triage-web.yaml and .gitlab/ci/triage-web.yml
(If applicable) Add documentation to the handbook pages for Triage Operations =>
(If applicable) Identify the affected groups and how to communicate to them:
- /cc @person_or_group =>
- Relevant Slack channels =>
- Engineering week-in-review

Edited Mar 27, 2023 by Rémy Coutable

PoC: Introduce a processor to create incidents from failing pipelines

What does this MR do and why?

Still to do

Expected impact & dry-runs

Action items

Merge request reports