Skip to content

Map CI/CD failures to failure categories

Context

Part of MR Cycle Time Track: Data Observability (&16185 - closed).

Problem Statement

We know that pipeline stability (e.g. flaky tests, infrastructure issues) is a major issue for GitLab Engineers. However, we currently have little visibility over the most important sources of instabilities.

Goal

  • Define the most common failure categories in CI/CD pipelines for gitlab-org/gitlab.
  • Map CI/CD pipeline/job failures to a failure category.

Progress

Technical Solution Overview

We initially started this issue by using CI/CD custom exit codes. There were a few technical limitations with that approach:

  1. We had to redirect the entire standard output to a file, and parse that file. This proved to be rather complex (all the logic we have is in shell scripts, which don't have tests written for them)
  2. We are limited in the number of exit codes we could have used (maximum 100, possibly less). Some were already overlapping with existing exit codes.

With Map RSpec CI jobs to a failure category (!187501 - merged), we used another approach by leveraging internal events. In the after_script section of CI/CD jobs, we:

  1. Download the CI/CD trace for the current job
  2. Analyze it, and try to map the error to a known failure category
  3. We push an internal event with that information

There are pros and cons to this approach:

  • (+) We don't have to manipulate the script section of a CI/CD job
  • (+) Written exclusively in Ruby, with tests to verify the behavior
  • (+) We are not limited by the number of failure categories anymore (we currently have around 180 of them)
  • (-) Downloading the CI/CD job trace via the API might not be complete at times, and we could miss the real failure if it happened later in the job
  • (-) An error happening after the after_script would not be caught (e.g. artifacts upload)
Edited by David Dieulivol