PoC: Introduce a processor to create incidents from failing pipelines
What does this MR do and why?
This MR currently does 3 things:
- Support pipeline events: !2116 (merged)
- Introduce classes to create incidents from pipeline events
- Introduce a processor to create incidents from failing pipelines
The goal is to replace the jobs and scripts from the main project, since "broken master" managements shouldn't be the responsibility of the main project, but is an Ops thing instead.
This effectively replaces:
- The
notify-pipeline-failure
job (except for the Slack notification part for now) - The
review-deploy-failure-notification
job (except for the Slack notification part for now) - The
scripts/create-pipeline-failure-incident.rb
script
Still to do
-
Move the two first two commits to separate MRs. -
Handle the Slack notification after creating the incident -
Handle ruby2
branch use-case -
Allow to post to Slack without creating incident (currently done for stable and ruby2
branches
Expected impact & dry-runs
These are strongly recommended to assist reviewers and reduce the time to merge your change.
See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-with-a-dry-run on how to perform dry-runs for new policies.
See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.
Action items
-
If adding environment variables for reactive processors, update config/triage-web.yaml
and.gitlab/ci/triage-web.yml
-
(If applicable) Add documentation to the handbook pages for Triage Operations => - (If applicable) Identify the affected groups and how to communicate to them:
-
/cc @ person_or_group
=> -
Relevant Slack channels => -
Engineering week-in-review
-
Edited by Rémy Coutable