CI should retry when runner clone/fetch fails

Summary

The CI level retry provides retry support when there are runner-related failures such as runner_system_failure.

However, we are seeing that there are occasionally runner failures related to repo cloning or fetches on the gitlab-com/www-gitlab-com repo. There are several examples of jobs which have failed in this way here: gitlab-com/www-gitlab-com#7054 (closed)

It should be possible for the runner to capture these failures and retry them when retry is enabled, either as part of the runner_system_failure when value, or a new when value.

(note: categorizing this as a 'bug' rather than a 'feature' because it is related to a preventable system failure)

Steps to reproduce

We see this once every day or so on the www-gitlab-com master builds: gitlab-com/www-gitlab-com#7054 (closed)

Example Project

Here are some specific job failure examples from www-gitlab-com master builds: gitlab-com/www-gitlab-com#7054 (closed) :

It is relevant that the www-gitlab-com repo is very large, over 6 gig. However, the last job failure above is interesting, because it's an example of several recently which are on reinitializes rather than full clones. This indicates that it's not necessarily the repo size, because a reinitialize/fetch should be a quick fast-forward which doesn't pull much data down

What is the current bug behavior?

Job fails and must be manually rerun/retried.

What is the expected correct behavior?

These should be able to be automatically retried

Relevant logs and/or screenshots

See job links above under "Sample Project"

Possible fixes

It should be possible to implement this retry somewhere in the runner code, e.g. handleGetSourcesStrategy in shells/abstract.go