Skip to content

Add failure signatures to CI failure analysis

Context

Extends the existing failure categories system to create unique signatures for better failure clustering and analysis.

What's in this MR?

Restructured the code:

  • Moved files from lib/failure_categories/ to lib/ci_failure_analysis/
  • Cleaned up YAML files by removing causes/solutions sections
  • Added new context extraction options to control how much text we capture from the CI failure

New failure signature system:

  • First extracts relevant context from CI job traces based on failure category rules
  • Then normalizes the text by removing timestamps, IDs, and file paths
  • Finally generates unique 16-character signatures so similar errors get the same hash
  • Helps identify when multiple jobs have the same underlying problem

New tools:

  • bin/analyze_signatures - Shows clustering quality and finds normalization gaps
  • Updated failure analyzer now outputs detailed CSV files with all the metadata

Better workflow:

  1. Run failure analysis to get signatures
  2. Use signature analyzer to see clustering quality
  3. Improve normalization rules based on findings

This gives us both broad categories (like "danger" or "rspec") and precise signatures within those categories for better incident detection and analysis.

What it looks like

I'll take the failed_to_pull_image failure category as an example:

  failed_to_pull_image:
    description: "Docker image pull failures in CI/CD, where container images cannot be downloaded from the registry."
    patterns:
      - 'ERROR: Job failed: failed to pull image "[^"]+"'
    context_scope: "match"

Output of the similarity analysis:

Click to expand
$ bin/analyze_signatures results_backup.csv --category failed_to_pull_image
================================================================================
CI FAILURE SIGNATURE ANALYSIS
================================================================================
Data source: results_backup.csv
Total records: 1198
Category filter: failed_to_pull_image

DETAILED ANALYSIS FOR: failed_to_pull_image
--------------------------------------------------
Total failures: 1198
Unique signatures: 31
Diversity ratio: 2.6%

TOP SIGNATURES:
  failed_to_pull_image__8de93063ed71890c (297 occurrences)
    Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11340496456

  failed_to_pull_image__aa420c7dc6ca1298 (260 occurrences)
    Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/gitlab-build-images:postgres-<ID>...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11346143284

  failed_to_pull_image__b7ece9751b5ce676 (230 occurrences)
    Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/gitlab-build-images:redis-cluster...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11332760679

  failed_to_pull_image__cf30adb515f8279b (73 occurrences)
    Context: ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/alpine:<ID>.<ID>"...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11342648590

  failed_to_pull_image__c93432913964bb9a (48 occurrences)
    Context: ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/alpine:latest"...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11323566051

  failed_to_pull_image__7517eacc8de6d9e3 (44 occurrences)
    Context: ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine3.<ID>"...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11337140018

  failed_to_pull_image__4208897883ad0993 (43 occurrences)
    Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11334230077

  failed_to_pull_image__67cdc1d1f344127f (35 occurrences)
    Context: ERROR: Job failed: failed to pull image "mirror.gcr.io/docker:<ID>.<ID>.<ID>-dind"...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11261841751

  failed_to_pull_image__18552d10be2769f9 (25 occurrences)
    Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11316044957

  failed_to_pull_image__e441ca6006434596 (20 occurrences)
    Context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<I...
    Example: https://gitlab.com/gitlab-org/gitlab/-/jobs/11306948645


SIMILARITY ANALYSIS:
Found 465 similar signature pairs:
1. Similarity: 98.9%
   Signature A: failed_to_pull_image__20a7c14ed148708f
   Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/10955078379
   Signature B: failed_to_pull_image__54851b4c24d7ded7
   Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10374525009
   Contexts with diff:
     - ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>"
     + ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-slim"

2. Similarity: 98.8%
   Signature A: failed_to_pull_image__de94862822de962e
   Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11343012470
   Signature B: failed_to_pull_image__7517eacc8de6d9e3
   Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/11337140018
   Contexts with diff:
     - ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine"
     + ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine3.<ID>"

3. Similarity: 98.7%
   Signature A: failed_to_pull_image__de94862822de962e
   Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11343012470
   Signature B: failed_to_pull_image__54851b4c24d7ded7
   Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10374525009
   Contexts with diff:
     - ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine"
     + ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-slim"

4. Similarity: 98.5%
   Signature A: failed_to_pull_image__de94862822de962e
   Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11343012470
   Signature B: failed_to_pull_image__20a7c14ed148708f
   Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10955078379
   Contexts with diff:
     - ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-alpine"
     + ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>"

5. Similarity: 98.2%
   Signature A: failed_to_pull_image__4208897883ad0993
   Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/11334230077
   Signature B: failed_to_pull_image__365b1f2d896e3b24
   Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10790553672
   Contexts with diff:
     - ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<ID>.<ID>-node-<ID>.<ID>:rubygems-<ID>.<ID>-git-<ID>.<ID>-lfs-<ID>.<ID>-yarn-<ID>.<ID>-graphicsmagick-<ID>.<ID>.<ID>"
     + ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<ID>.<ID>-node-<ID>.<ID>:rubygems-<ID>.<ID>-git-<ID>.<ID>-lfs-<ID>.<ID>-yarn-<ID>.<ID>-graphicsmagick-<ID>.<ID>.<ID>-docker-<ID>.<ID>.<ID>"

... and 460 more similar pairs

It shows a few interesting results:

  1. Diversity ratio
Total failures: 1198
Unique signatures: 31
Diversity ratio: 2.6%

This is an excellent grouping: we have 31 signatures for ~1200 CI jobs.

  1. For example, we found 297 jobs with this failure signature: failed_to_pull_image__8de93063ed71890c:
  • Job: https://gitlab.com/gitlab-org/gitlab/-/jobs/10429441036
  • Failure category: failed_to_pull_image
  • Failure signature: failed_to_pull_image__8de93063ed71890c,8de93063ed71890c
  • Normalized context: ERROR: Job failed: failed to pull image "registry.gitlab.com/<path>/debian-bookworm-slim-ruby-<ID>.<ID>.<ID>-golang-<ID>.<ID>-node-<ID>.<ID>-postgresql-<ID>:rubygems-<ID>.<ID>-git-<ID>.<ID>-lfs-<ID>.<ID>-chrome-<ID>-yarn-<ID>.<ID>-graphicsmagick-<ID>.<ID>.<ID>"
  • Raw context: ERROR: Job failed: failed to pull image "registry.gitlab.com/gitlab-org/gitlab-build-images/ci/debian-bookworm-slim-ruby-3.3.8-golang-1.23-node-20.12-postgresql-14:rubygems-3.6-git-2.49-lfs-2.9-chrome-123-yarn-1.22-graphicsmagick-1.3.36"
  1. Some signatures have a very similar context:
SIMILARITY ANALYSIS:
Found 465 similar signature pairs:
1. Similarity: 98.9%
   Signature A: failed_to_pull_image__20a7c14ed148708f
   Example A: https://gitlab.com/gitlab-org/gitlab/-/jobs/10955078379
   Signature B: failed_to_pull_image__54851b4c24d7ded7
   Example B: https://gitlab.com/gitlab-org/gitlab/-/jobs/10374525009
   Contexts with diff:
     - ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>"
     + ERROR: Job failed: failed to pull image "gitlab.com:<ID>/<path>/ruby:<ID>.<ID>.<ID>-slim"

This might indicate that we need to improve the context normalization, change the context (get more/less), but not always. Sometimes, failures will be extremely similar, but only differ by a very important attribute, which should rightfully change the signature.

Edited by David Dieulivol

Merge request reports

Loading