Eliminate Git-Related Infrastructure Failures in CI Pipelines
Context
Part of Improve observability and reduce most frequent ... (&8 - closed). Many CI/CD jobs are failing before our automation tools can properly detect, analyze, and categorize them. This issue specifically focuses on analyzing and eliminating the Infrastructure failure category. Based on data from Snowflake (https://app.snowflake.com/ys68254/gitlab/w1oaUFxQaSYz), we've identified several patterns of git-related failures that are causing pipeline instability.
Business Impact
-
Impact to Internal Engineering:
- Git-related infrastructure failures directly reduce engineering throughput. These failures often occur before our automation can triage them, requiring manual debugging and wasting valuable developer time. This increases cycle time, delays delivery, and reduces focus on high-impact work.
- Similar issues mentioned by developer and in comments 1 , 2
A lot of builds fail because git fails (transient but when you have 100 jobs a higher chance to happen to one of them in your pipeline)
- Various master-broken incidents
- gitlab-org/quality/engineering-productivity/master-broken-incidents#12974 (closed)) (slack)
- gitlab-org/quality/engineering-productivity/master-broken-incidents#13103 (closed) (Slack)
- gitlab-org/quality/engineering-productivity/master-broken-incidents#13146 (closed) (Slack)
- gitlab-org/quality/engineering-productivity/master-broken-incidents#12997 (closed) (Slack), etc
-
Impact to external customers:
- Customers experience recurring git issues across multiple projects, creating a perception of systemic unreliability.
- Hidden Impact Beyond Reported Incidents: Most git failures never reach formal incident status as customers silently retry failed jobs, masking the true scale of the problem
- Productivity Drain Through Silent Retries: Customers absorb significant productivity losses through repeated job retries, creating a "shadow cost" due to increasing infrastructure load because of repeated job retries.
Most Impactful Error States
The resolutions are tracked in the child tasks
| Error | Tracker | Description | Team to work with |
|---|---|---|---|
| "fatal: couldn't find remote ref" failures | Failure while cloning a repo with ref pipelineID | Verify | |
| "gitaly spawn failed" errors | unable to connect to gitaly | ||
| "fatal: fetch-pack: invalid index-pack output" | Git fails to receive or process repository data |
|
|
| "fatal: the remote end hung up unexpectedly" | Server abruptly terminates the connection during a Git operation, | ||
| "GitLab is currently unable to handle this request due to load" | Too much load while cloning a repo |
|
Edited by David Dieulivol