Gitaly Service Failures in CI Pipelines ("gitaly spawn failed")
## Description
We've identified that 1-2% of infrastructure-related pipeline job failures are due to "gitaly spawn failed" errors.
As Gitaly is used by GitLab's to handles all Git operations, and these failures indicate problems with the Gitaly service itself.
## Affected Jobs
Below are some examples of JobID and its respective Correlation ID
| Job ID | correlation ID |
| ------ | ------ |
| [Job #10063743179](https://gitlab.com/gitlab-org/gitlab/-/jobs/10063743179) | bcb33eb5e5fe9542a7d37909e0f4bd96 |
| [Job #10063743174](https://gitlab.com/gitlab-org/gitlab/-/jobs/10063743174) | 7cbf5fc8085e20071d996a3dfa3051fb |
## Error Logs
```
Spawning Gitaly
02:40
Trying to connect to gitaly: ................................................................................................................................................................................................................................................................................................................................................................................................................ FAILED to connect to gitaly
/builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:374:in `rescue in spawn_gitaly': gitaly spawn failed (RuntimeError)
log/gitaly-test.log:
time="2025-05-16T23:47:45.512Z" level=info msg="maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined" pid=3812
time="2025-05-16T23:47:45.512Z" level=info msg="grpc prometheus histograms enabled" latencies="[0.001 0.005 0.025 0.1 0.5 1 10 30 60 300 1500]" pid=3812
from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:346:in `spawn_gitaly'
from scripts/gitaly-test-spawn:20:in `run'
from scripts/gitaly-test-spawn:24:in `<main>'
/builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:218:in `try_connect!': could not connect to gitaly (RuntimeError)
from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:153:in `start'
from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:126:in `start_gitaly'
from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:352:in `spawn_gitaly'
from scripts/gitaly-test-spawn:20:in `run'
from scripts/gitaly-test-spawn:24:in `<main>'
PID PPID S %CPU %MEM ELAPSED COMMAND
3812 3811 S 6.4 4.4 00:41 /builds/gitlab-org/gitlab/tmp/tests/gitaly/_build/bin/gitaly /builds/gitlab-org/gitlab/tmp/tests/gitaly/config.toml.transactions
```
## Preliminary Analysis
We cannot retry the affected jobs (as some artifacts are needed to be retrieved), making it difficult to reproduce the error. Potential causes include:
- Resource constraints (CPU, memory) on the Gitaly server
- Configuration issues in the Gitaly service
- Network connectivity issues between services
Given the low occurrence rate, this _MIGHT_ be related to specific Gitaly nodes or temporary infrastructure issues rather than a systemic problem.
## Next Steps
- Connect with the team handling gitaly
- Review recent Gitaly configuration changes or deployments
- Understand the fallback mechanism for Gitaly failures
- Investigate potential resource constraints during peak usage
- Explore options for more graceful failure handling
task