Add attempts to Docker executor for container not found (!1995) · Merge requests · GitLab.org / gitlab-runner

Steve Xuereb requested to merge 4450-retry-stage-on-container-not-found into master Apr 06, 2020

What does this MR do?

Retry stage when container is not found inside of the Docker executor

Why was this MR needed?

When using the Docker executor and one of the stages fail because of No Such Container error retry that specific stage, up to 2 more times (3 tries in total). This makes the executor a lot resilient to issues where we are running a stage and the container get removed by some other system.

The safeBuffer is necessary to to prevent more data races inside of our code base, when we run with go test -race, this is because we are writing to the job log and reading from the job log to trigger specific parts of the integration test. The integration test turned out to be quite big and not a simple one. There is not way we can easily mock the client from the docker_test package, since the main goal of this package is to be an E2E/integration test.

Testing

Linux/Windows `EXECUTOR_JOB_SECTION_ATTEMPTS` set to 2

config.toml

[[runners]]
  name = "docker"
  url = "http://192.168.144.160:3000"
  token = "xxxx"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "alpine:3.11"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0

.gitlab-ci.yml

variables:
  SLEEP: 3600

job:
  script:
  - sleep ${SLEEP}

Steps:

Start job with the Runner configured as above and using the .gitlab-ci.yml above.
Wait for job to get to the sleep command
Inside of a terminal window run docker ps

From the docker ps output find the build container and do a docker rm -f $CONTAINER_ID for example:

example

$ docker ps
$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
6e341b6eb96d        a187dde48cd2        "sh -c 'if [ -x /usr‚Ä¶"   20 seconds ago      Up 20 seconds                                runner-fl5ihr7-project-19-concurrent-0-build-4


$ docker rm -f 6e341b6eb96d
6e341b6eb96d

You should see the build script stage being retried: Linux/Windows
If you want you can remove the build containers 1 more time and it should fail the job for example

Linux/Windows `EXECUTOR_JOB_SECTION_ATTEMPTS` not set (current behavior on master)

Start job with the Runner configured as above and using the .gitlab-ci.yml above.
Wait for job to get to the sleep command
Inside of a terminal window run docker ps

From the docker ps output find the build container and do a docker rm -f $CONTAINER_ID for example:

example

$ docker ps
$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
6e341b6eb96d        a187dde48cd2        "sh -c 'if [ -x /usr‚Ä¶"   20 seconds ago      Up 20 seconds                                runner-fl5ihr7-project-19-concurrent-0-build-4


$ docker rm -f 6e341b6eb96d
6e341b6eb96d

Job failed because the container is not found

Does this MR meet the acceptance criteria?

Documentation created/updated
Added tests for this feature/bug
In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

Reference #4450 (closed)

Edited Apr 09, 2020 by Steve Xuereb

Add attempts to Docker executor for container not found