WIP: Add retry on docker requests

What does this MR do?

Add a retry mechanism for all docker commands that can return a 404 on sending requests to the Docker API. Use the retry package, so every time a 404 is returned from the API it is returned it is retried with some backoff logic.

Why was this MR needed?

Sometimes when the Docker daemon is underload or there is performance problems it might result in the Runner trying to start/inspect a container that hasn't been created yet, which results in a failure on the job.

Are there points in the code the reviewer needs to double check?

Looking at all the methods that we use and at v1.25 API below are the ones that can return a 404 because the container is not found:

  • ImageInspectWithRaw
  • ContainerStart
  • ContainerKill
  • ContainerInspect
  • ContainerAttach
  • ContainerRemove
  • ContainerLogs
  • ContainerExecCreate
  • ContainerExecAttach
  • NetworkRemove
  • NetworkDisconnect
  • NetworkInspect

Discussion for PoC


Docker will return an objecNotFound, every time a 404 is returned from the API. It doesn't seem like there is a way to make it automatically retry when 404 error is returned in wrapError or anything since each function call is different. To make it reusable we need to wrap each call a run function which will just return an error and then check that error type which is what we are doing in this PoC.

What you should be looking when evaluating the PoC

  • Does it make sense, how we are using the retry mechanism?
  • Is there a better way instead of having us create a struct to implement the retryable interface every time?
  • Do you think it's clear that the user has to look at the struct fields to get the response from the field?
  • Do we all wish Go has generics?
  • Would it be easier to implement a backoff mechanism just for the Docker library?
  • Does it make sense to have context, passed to Run instead of having it as a struct field.
  • Should we implement this for every method we have inside of the client or the ones that only that make sense?

What you should NOT be looking at when evaluating the PoC

  • Code quality
  • Logging
  • Any error checking

This is a simple PoC to get an idea of what we need to do. The

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Added tests for this feature/bug
  • In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

reference #4450 (closed)

Edited by Steve Xuereb

Merge request reports