Add the PushFailureCategoryForJobs processor (sandbox-only) (!3436) · Merge requests · GitLab.org / Developer Experience / triage-ops

Problem Statement

In Map RSpec CI jobs to a failure category (gitlab-org/gitlab!187501 - merged), we implemented mappings from CI jobs to failure categories directly in gitlab-org/gitlab.

As I discovered in Data quality - Fix blind spots when we cannot r... (gitlab-org/quality/analytics/team#103 - closed), the approach has several drawbacks:

Not all jobs are using Ruby
Some jobs fail before the after_script is run (e.g. git clone issue in the CI job), or after (e.g. artifacts upload issue, job timeout, ...)
The logic to download the trace is complicated, because at the time when we execute the after_script, the job trace is sometimes not fully available (to counteract this, we are retrying several times to download the trace, up until we see a known "marker" in the trace)

Using GitLab pipelines webhooks is what this MR proposes to:

Analyze more failed CI jobs, regardless of the tech stack they have available when running
Analyze all failures, regardless of when they happened in the codebase
Given the webhook would be triggered once the pipeline finished, most of the jobs would have finished by then, and so the likelihood of having full job traces increases greatly (and if it's still not enough time in some cases, we could still introduce background jobs with a delay of a few minutes)

Contributes to Data quality - Fix blind spots when we cannot r... (gitlab-org/quality/analytics/team#103 - closed)

Technical considerations

I was initially reluctant using triage-ops for a few reasons:

The number of pipelines might be too much?
The number of failed jobs might be too much?
triage-ops might not be stable enough

For 1., we won't have more than 20k failed pipelines per month, which is completely acceptable number in terms of load (and note that the webhook events are currently already reaching triage-ops):

See SQL query for the dashboard above

SELECT
    DATE_TRUNC('month', finished_at)::date AS pipeline_finished_at_month,
    COUNT(*)
FROM dim_ci_pipeline
WHERE dim_project_id IN (278964, 13083)
  AND status = 'failed'
GROUP BY 1

For 2., I limited the number of failed jobs we'll consider to 20 per failed pipeline. I gave my reasoning in this comment. Once we have reviewed the time it takes for the processor to work on average and see how many jobs aren't mapped, we'll be able to change that value 👍

For 3., things have greatly improved:

I think webhooks are re-enabled automatically after a failure (that's what I witnessed: triage-ops was down/up in ~3 minutes without any intervention)
We have monitoring in place to quickly notice when things are not behaving as we expect them to
Those failure categories are not mission-critical: we can afford to not have them for a day or two if an outage were to happen.

What does this MR do?

Moves the failure category mappings logic to triage-ops.
Greatly simplifies the logic as well: some requirements are relaxed inside triage-ops.

Deployment

This processor is currently in sandbox mode, meaning it will react to failed pipelines in the https://gitlab.com/gitlab-org/quality/engineering-productivity/triage-ops-playground project.

Once I tested it there, I'll create another MR to have it running on gitlab-org/gitlab and gitlab-org/gitlab-foss.

Proof of work

Locally

Disclaimer: I don't want to share the webhooks, as they sometimes have confidential information.

I fetched some pipeline webhook events from GCP logs (see query). The process to have them on your local environment is explained in https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/run_locally.md#reproduce-a-production-webhook-event-locally.

I then ran the processor directly on those events.

cd ~/src/triage-ops
export GITLAB_API_ENDPOINT=https://gitlab.com
bundle exec rack-console

payload = JSON.parse(File.read("#{Dir.home}/Desktop/test1.json")); event = Triage::Event.build(payload); p = Triage::PushFailureCategoryForJobs.new(event);

p.send :applicable?
p.send :process

Below are some outputs I received:

Pipeline: https://gitlab.com/gitlab-org/gitlab/-/pipelines/1856676997/failures

[2] pry(main)> p.send :process
=> {10273577043=>"logs_too_big_to_analyze", 10273576919=>"logs_too_big_to_analyze", 10273576925=>"logs_too_big_to_analyze", 10273577044=>"logs_too_big_to_analyze"}

Pipeline: https://gitlab.com/gitlab-org/gitlab-foss/-/pipelines/1856349695/failures

[2] pry(main)> p.send :process
=> {10271862722=>"rspec_valid_rspec_errors_or_flaky_tests"}

Pipeline: https://gitlab.com/gitlab-org/gitlab/-/pipelines/1856611910/failures

[2] pry(main)> p.send :process
=> {10273188567=>"rspec_valid_rspec_errors_or_flaky_tests"}

Expected impact & dry-runs

These are strongly recommended to assist reviewers and reduce the time to merge your change.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-policies-with-a-dry-run on how to perform dry-runs for new policies.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.

Action items

If adding environment variables for reactive processors, update config/triage-web.yaml and .gitlab/ci/triage-web.yml
(If applicable) Add documentation to the handbook pages for Triage Operations =>
(If applicable) Identify the affected groups and how to communicate to them:
- /cc @person_or_group =>
- Relevant Slack channels =>
- Engineering week-in-review

Edited Jun 06, 2025 by David Dieulivol

Add the PushFailureCategoryForJobs processor (sandbox-only)