Map CI/CD failures to failure categories

Context

Part of MR Cycle Time Track: Data Observability (&16185 - closed).

Problem Statement

We know that pipeline stability (e.g. flaky tests, infrastructure issues) is a major issue for GitLab Engineers. However, we currently have little visibility over the most important sources of instabilities.

Goal

Define the most common failure categories in CI/CD pipelines for gitlab-org/gitlab.
Map CI/CD pipeline/job failures to a failure category.

Progress

✅ DX - CI Failures is now using internal events under the hood 🎉
✅ We only map failure categories in RSpec jobs (20% of all the failures). We'll need to expand this list.
- Done in Expand the failure category reporting to most o... (#537985 - closed)
✅ We'll need a process to consolidate/refine the existing failure categories.
💭 Possible improvements
- ✅ We should add a description field in the failure categories definitions
  - Done in Change failure category file structure and add ... (!189645 - merged)
- ✅ Group patterns by category
  - Done in Change failure category file structure and add ... (!189645 - merged)
- ✅ Have a local CLI one-liner with a job URL as param. We could just give a URL (like we do with pipeline visualizer), and have the error that was guessed…would be super useful!
  - Done in CI Failure categories - Return the matching pat... (!189903 - merged)

Technical Solution Overview

We initially started this issue by using CI/CD custom exit codes. There were a few technical limitations with that approach:

We had to redirect the entire standard output to a file, and parse that file. This proved to be rather complex (all the logic we have is in shell scripts, which don't have tests written for them)
We are limited in the number of exit codes we could have used (maximum 100, possibly less). Some were already overlapping with existing exit codes.

With Map RSpec CI jobs to a failure category (!187501 - merged), we used another approach by leveraging internal events. In the after_script section of CI/CD jobs, we:

Download the CI/CD trace for the current job
Analyze it, and try to map the error to a known failure category
We push an internal event with that information

There are pros and cons to this approach:

(+) We don't have to manipulate the script section of a CI/CD job
(+) Written exclusively in Ruby, with tests to verify the behavior
(+) We are not limited by the number of failure categories anymore (we currently have around 180 of them)
(-) Downloading the CI/CD job trace via the API might not be complete at times, and we could miss the real failure if it happened later in the job
(-) An error happening after the after_script would not be caught (e.g. artifacts upload)

Edited May 07, 2025 by David Dieulivol