Map CI/CD failures to failure categories
Context
Part of MR Cycle Time Track: Data Observability (&16185 - closed).
Problem Statement
We know that pipeline stability (e.g. flaky tests, infrastructure issues) is a major issue for GitLab Engineers. However, we currently have little visibility over the most important sources of instabilities.
Goal
- Define the most common failure categories in CI/CD pipelines for
gitlab-org/gitlab. - Map CI/CD pipeline/job failures to a failure category.
Progress
-
✅ DX - CI Failures is now using internal events under the hood🎉 -
✅ We only map failure categories in RSpec jobs (20% of all the failures). We'll need to expand this list. -
✅ We'll need a process to consolidate/refine the existing failure categories. -
💭 Possible improvements-
✅ We should add a description field in the failure categories definitions -
✅ Group patterns by category -
✅ Have a local CLI one-liner with a job URL as param. We could just give a URL (like we do with pipeline visualizer), and have the error that was guessed…would be super useful!
-
Technical Solution Overview
We initially started this issue by using CI/CD custom exit codes. There were a few technical limitations with that approach:
- We had to redirect the entire standard output to a file, and parse that file. This proved to be rather complex (all the logic we have is in shell scripts, which don't have tests written for them)
- We are limited in the number of exit codes we could have used (maximum 100, possibly less). Some were already overlapping with existing exit codes.
With Map RSpec CI jobs to a failure category (!187501 - merged), we used another approach by leveraging internal events. In the after_script section of CI/CD jobs, we:
- Download the CI/CD trace for the current job
- Analyze it, and try to map the error to a known failure category
- We push an internal event with that information
There are pros and cons to this approach:
- (+) We don't have to manipulate the
scriptsection of a CI/CD job - (+) Written exclusively in Ruby, with tests to verify the behavior
- (+) We are not limited by the number of failure categories anymore (we currently have around 180 of them)
- (-) Downloading the CI/CD job trace via the API might not be complete at times, and we could miss the real failure if it happened later in the job
- (-) An error happening after the
after_scriptwould not be caught (e.g. artifacts upload)
Edited by David Dieulivol