Retry docker tasks that timeout during job preparation stage and log failed operations
Description
There is a long-running open issue Docker-machine Preparation Failed affecting some customers where docker jobs fail intermittently during the preparation stage. In some cases the cause appears to be related to docker+kernel versions or SELinux, but there have also been reports indicating it can related to the load on the docker host causing docker operations to fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock
errors. In some cases reducing the concurrency on the host has helped, and often the job will run successfully when retried soon after the initial failure.
This issue suggests making two improvements that while not addressing the root cause would improve the handling of such errors, namely:
- implement a retry mechanism for all docker operations performed by the runner in preparing the job container(s) and associated resources, to allow for docker daemon responsiveness issues caused by transient high CPU and/or IO load; and
- add debug logging to the job preparation stage so that the details of all operations performed and any resulting errors or retries can be observed.
A related suggestion has also been made (ZD internal link) to provide for a custom script to be run on the runner host in the event of job failure during the preparation stage to allow system and performance metrics to be collected and notifications be sent to interested parties, to assist with root cause diagnosis.