Add failure signatures to CI failure analysis
Context
Extends the existing failure categories system to create unique signatures for better failure clustering and analysis.
- Closes Add explicit failure categories for automation ... (gitlab-org/quality/analytics/team#295 - closed)
- Closes Clean up and remove unused fields from existing... (gitlab-org/quality/analytics/team#290 - closed)
- Closes Create documentation for CI failure signatures (gitlab-org/quality/analytics/team#289 - closed)
- Closes Display failure signatures in FCA (Failure Cate... (gitlab-org/quality/analytics/team#288 - closed)
- Closes Establish process to validate and improve failu... (gitlab-org/quality/analytics/team#287 - closed)
- Closes Implement failure signatures for generic failur... (gitlab-org/quality/analytics/team#274 - closed)
What's in this MR?
Restructured the code:
- Moved files from
lib/failure_categories/tolib/ci_failure_analysis/ - Cleaned up YAML files by removing causes/solutions sections
- Added new context extraction options to control how much text we capture from the CI failure
New failure signature system:
- First extracts relevant context from CI job traces based on failure category rules
- Then normalizes the text by removing timestamps, IDs, and file paths
- Finally generates unique 16-character signatures so similar errors get the same hash
- Helps identify when multiple jobs have the same underlying problem
New tools:
-
bin/analyze_signatures- Shows clustering quality and finds normalization gaps - Updated failure analyzer now outputs detailed CSV files with all the metadata
Better workflow:
- Run failure analysis to get signatures
- Use signature analyzer to see clustering quality
- Improve normalization rules based on findings
This gives us both broad categories (like "danger" or "rspec") and precise signatures within those categories for better incident detection and analysis.
What it looks like
I'll take the failed_to_pull_image failure category as an example:
failed_to_pull_image:
description: "Docker image pull failures in CI/CD, where container images cannot be downloaded from the registry."
patterns:
- 'ERROR: Job failed: failed to pull image "[^"]+"'
context_scope: "match"
Output of the similarity analysis:
Click to expand
$ bin/analyze_signatures results_backup.csv --category failed_to_pull_image
================================================================================
CI FAILURE SIGNATURE ANALYSIS
================================================================================
Data source: results_backup.csv
Total records: 1198
Category filter: failed_to_pull_image
DETAILED ANALYSIS FOR: failed_to_pull_image
--------------------------------------------------
Total failures: 1198
Unique signatures: 31
Diversity ratio: 2.6%
TOP SIGNATURES:
failed_to_pull_image__8de93063ed71890c (297 occurrences)
Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11340496456
failed_to_pull_image__aa420c7dc6ca1298 (260 occurrences)
Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/gitlab-build-images:postgres-<ID>...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11346143284
failed_to_pull_image__b7ece9751b5ce676 (230 occurrences)
Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/gitlab-build-images:redis-cluster...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11332760679
failed_to_pull_image__cf30adb515f8279b (73 occurrences)
Context: ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/alpine:<ID>.<ID>"...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11342648590
failed_to_pull_image__c93432913964bb9a (48 occurrences)
Context: ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/alpine:latest"...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11323566051
failed_to_pull_image__7517eacc8de6d9e3 (44 occurrences)
Context: ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine3.<ID>"...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11337140018
failed_to_pull_image__4208897883ad0993 (43 occurrences)
Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11334230077
failed_to_pull_image__67cdc1d1f344127f (35 occurrences)
Context: ERROR: Job failed: failed to pull image "mirror.gcr.io/docker:<ID>.<ID>.<ID>-dind"...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11261841751
failed_to_pull_image__18552d10be2769f9 (25 occurrences)
Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11316044957
failed_to_pull_image__e441ca6006434596 (20 occurrences)
Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11306948645
SIMILARITY ANALYSIS:
Found 465 similar signature pairs:
1. Similarity: 98.9%
Signature A: failed_to_pull_image__20a7c14ed148708f
Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/10955078379
Signature B: failed_to_pull_image__54851b4c24d7ded7
Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10374525009
Contexts with diff:
- ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>"
+ ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-slim"
2. Similarity: 98.8%
Signature A: failed_to_pull_image__de94862822de962e
Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11343012470
Signature B: failed_to_pull_image__7517eacc8de6d9e3
Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/11337140018
Contexts with diff:
- ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine"
+ ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine3.<ID>"
3. Similarity: 98.7%
Signature A: failed_to_pull_image__de94862822de962e
Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11343012470
Signature B: failed_to_pull_image__54851b4c24d7ded7
Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10374525009
Contexts with diff:
- ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine"
+ ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-slim"
4. Similarity: 98.5%
Signature A: failed_to_pull_image__de94862822de962e
Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11343012470
Signature B: failed_to_pull_image__20a7c14ed148708f
Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10955078379
Contexts with diff:
- ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine"
+ ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>"
5. Similarity: 98.2%
Signature A: failed_to_pull_image__4208897883ad0993
Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11334230077
Signature B: failed_to_pull_image__365b1f2d896e3b24
Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10790553672
Contexts with diff:
- ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<ID>.<ID>-node-<ID>.<ID>:rubygems-<ID>.<ID>-git-<ID>.<ID>-lfs-<ID>.<ID>-yarn-<ID>.<ID>-graphicsmagick-<ID>.<ID>.<ID>"
+ ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<ID>.<ID>-node-<ID>.<ID>:rubygems-<ID>.<ID>-git-<ID>.<ID>-lfs-<ID>.<ID>-yarn-<ID>.<ID>-graphicsmagick-<ID>.<ID>.<ID>-docker-<ID>.<ID>.<ID>"
... and 460 more similar pairs
It shows a few interesting results:
- Diversity ratio
Total failures: 1198
Unique signatures: 31
Diversity ratio: 2.6%
This is an excellent grouping: we have 31 signatures for ~1200 CI jobs.
- For example, we found 297 jobs with this failure signature:
failed_to_pull_image__8de93063ed71890c:
- Job: https://gitlab.com/gitlab-org/gitlab/-/jobs/10429441036
- Failure category:
failed_to_pull_image - Failure signature:
failed_to_pull_image__8de93063ed71890c,8de93063ed71890c - Normalized context:
ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<ID>.<ID>-golang-<ID>.<ID>-node-<ID>.<ID>-postgresql-<ID>:rubygems-<ID>.<ID>-git-<ID>.<ID>-lfs-<ID>.<ID>-chrome-<ID>-yarn-<ID>.<ID>-graphicsmagick-<ID>.<ID>.<ID>" - Raw context:
ERROR: Job failed: failed to pull image "registry.gitlab.com/gitlab-org/gitlab-build-images/ci/debian-bookworm-slim-ruby-3.3.8-golang-1.23-node-20.12-postgresql-14:rubygems-3.6-git-2.49-lfs-2.9-chrome-123-yarn-1.22-graphicsmagick-1.3.36"
- Some signatures have a very similar context:
SIMILARITY ANALYSIS:
Found 465 similar signature pairs:
1. Similarity: 98.9%
Signature A: failed_to_pull_image__20a7c14ed148708f
Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/10955078379
Signature B: failed_to_pull_image__54851b4c24d7ded7
Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10374525009
Contexts with diff:
- ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>"
+ ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-slim"
This might indicate that we need to improve the context normalization, change the context (get more/less), but not always. Sometimes, failures will be extremely similar, but only differ by a very important attribute, which should rightfully change the signature.