CI should retry when runner clone/fetch fails
Summary
The CI level retry provides retry support when there are runner-related failures such as runner_system_failure
.
However, we are seeing that there are occasionally runner failures related to repo cloning or fetches on the gitlab-com/www-gitlab-com
repo. There are several examples of jobs which have failed in this way here: gitlab-com/www-gitlab-com#7054 (closed)
It should be possible for the runner to capture these failures and retry them when retry
is enabled, either as part of the runner_system_failure
when
value, or a new when
value.
(note: categorizing this as a 'bug' rather than a 'feature' because it is related to a preventable system failure)
Steps to reproduce
We see this once every day or so on the www-gitlab-com
master builds: gitlab-com/www-gitlab-com#7054 (closed)
Example Project
Here are some specific job failure examples from www-gitlab-com
master builds: gitlab-com/www-gitlab-com#7054 (closed) :
- https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/502245223
- https://gitlab.com/gitlab-com/www-gitlab-com/pipelines/134562035/builds (several)
- https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/508591732
- https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/506984891
It is relevant that the www-gitlab-com
repo is very large, over 6 gig. However, the last job failure above is interesting, because it's an example of several recently which are on reinitializes rather than full clones. This indicates that it's not necessarily the repo size, because a reinitialize/fetch should be a quick fast-forward which doesn't pull much data down
What is the current bug behavior?
Job fails and must be manually rerun/retried.
What is the expected correct behavior?
These should be able to be automatically retried
Relevant logs and/or screenshots
See job links above under "Sample Project"
Possible fixes
It should be possible to implement this retry somewhere in the runner code, e.g. handleGetSourcesStrategy
in shells/abstract.go