Gitaly Service Failures in CI Pipelines ("gitaly spawn failed")
## Description We've identified that 1-2% of infrastructure-related pipeline job failures are due to "gitaly spawn failed" errors. As Gitaly is used by GitLab's to handles all Git operations, and these failures indicate problems with the Gitaly service itself. ## Affected Jobs Below are some examples of JobID and its respective Correlation ID | Job ID | correlation ID | | ------ | ------ | | [Job #10063743179](https://gitlab.com/gitlab-org/gitlab/-/jobs/10063743179) | bcb33eb5e5fe9542a7d37909e0f4bd96 | | [Job #10063743174](https://gitlab.com/gitlab-org/gitlab/-/jobs/10063743174) | 7cbf5fc8085e20071d996a3dfa3051fb | ## Error Logs ``` Spawning Gitaly 02:40 Trying to connect to gitaly: ................................................................................................................................................................................................................................................................................................................................................................................................................ FAILED to connect to gitaly /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:374:in `rescue in spawn_gitaly': gitaly spawn failed (RuntimeError) log/gitaly-test.log: time="2025-05-16T23:47:45.512Z" level=info msg="maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined" pid=3812 time="2025-05-16T23:47:45.512Z" level=info msg="grpc prometheus histograms enabled" latencies="[0.001 0.005 0.025 0.1 0.5 1 10 30 60 300 1500]" pid=3812 from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:346:in `spawn_gitaly' from scripts/gitaly-test-spawn:20:in `run' from scripts/gitaly-test-spawn:24:in `<main>' /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:218:in `try_connect!': could not connect to gitaly (RuntimeError) from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:153:in `start' from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:126:in `start_gitaly' from /builds/gitlab-org/gitlab/spec/support/helpers/gitaly_setup.rb:352:in `spawn_gitaly' from scripts/gitaly-test-spawn:20:in `run' from scripts/gitaly-test-spawn:24:in `<main>' PID PPID S %CPU %MEM ELAPSED COMMAND 3812 3811 S 6.4 4.4 00:41 /builds/gitlab-org/gitlab/tmp/tests/gitaly/_build/bin/gitaly /builds/gitlab-org/gitlab/tmp/tests/gitaly/config.toml.transactions ``` ## Preliminary Analysis We cannot retry the affected jobs (as some artifacts are needed to be retrieved), making it difficult to reproduce the error. Potential causes include: - Resource constraints (CPU, memory) on the Gitaly server - Configuration issues in the Gitaly service - Network connectivity issues between services Given the low occurrence rate, this _MIGHT_ be related to specific Gitaly nodes or temporary infrastructure issues rather than a systemic problem. ## Next Steps - Connect with the team handling gitaly - Review recent Gitaly configuration changes or deployments - Understand the fallback mechanism for Gitaly failures - Investigate potential resource constraints during peak usage - Explore options for more graceful failure handling
task