Running many CI jobs at onces fails with different errors on gitlab.com with shared runners
Description of the problem
When running against many runners at once, I always get errors with
EOF, unable to connect to docker host, exit code 1, no space left on device or timeouts.
ERROR: Job failed (system failure): Cannot connect to the Docker daemon. Is 'docker daemon' running on this host? (executor_docker.go:1007:0s)`
unexpected EOF ERROR: Job failed: exit code 1
no space left on device ERROR: Job failed: exit code 1
ERROR: Job failed: exit code 137
ERROR: Job failed (system failure): error during connect: Get https://10.142.2.117:2376/v1.18/containers/a1bf1f3478b898774bf16c396c48eb1b853e70a2bab0e3afae04c7caf7da5215/json: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "root") (executor_docker.go:965:4s)
ERROR: Job failed (system failure): Error: No such container: 62a4f326396774ac2a6ff4331afa18d4da3ae253944b6d37272d14421edc41e0 (executor_docker.go:965:1s)
Sometimes, there's no error in the log, but job's description says
There has been a runner system failure, please try again.
The pipeline in question is ok now as I clicked retry, but failed jobs are still accessible here: https://gitlab.com/gableroux/unity3d/pipelines/26396580/builds (visit the jobs page and scroll down until you see failed jobs)
Retrying a couple of times these failed jobs works so there's probably something wrong when running many jobs at once. At least it is the case when using free shared runners.
Which Group/Project (with full path) is experiencing the issue?
When does the issue happen?
Every time the CI runs in this project. It started doing this as soon as I started building 75 jobs at the same time or more.
I shouldn't have to manually retry these jobs as they should not fail in the first place.
- I'd prefer limiting the number of concurrent builds but at least knowing that all the ones that will run won't fail.
- Have an actual fix for most of the above errors
- Have a way to automatically retry jobs n times? That seems already possible according to gitlab-org/gitlab-ce#3442 (closed)
retry: <number>, default is 0. I did not try that.
I have seen a few ones, but I didn't find one including all of the errors I get.
How to reproduce
- Fork https://gitlab.com/gableroux/unity3d/
- Wait for the CI to run (will take a while)
- After ~40 minutes, you should have a few failed jobs
Yeah, just saying thanks as the project in question is actually quite greedy, but it works for free so thanks gitlab. Feel free to contribute