Skip to content

Add attempts to Docker executor for container not found

Steve Xuereb requested to merge 4450-retry-stage-on-container-not-found into master

What does this MR do?

Retry stage when container is not found inside of the Docker executor

Why was this MR needed?

When using the Docker executor and one of the stages fail because of No Such Container error retry that specific stage, up to 2 more times (3 tries in total). This makes the executor a lot resilient to issues where we are running a stage and the container get removed by some other system.

The safeBuffer is necessary to to prevent more data races inside of our code base, when we run with go test -race, this is because we are writing to the job log and reading from the job log to trigger specific parts of the integration test. The integration test turned out to be quite big and not a simple one. There is not way we can easily mock the client from the docker_test package, since the main goal of this package is to be an E2E/integration test.

Testing

Linux/Windows EXECUTOR_JOB_SECTION_ATTEMPTS set to 2

config.toml
[[runners]]
  name = "docker"
  url = "http://192.168.144.160:3000"
  token = "xxxx"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "alpine:3.11"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
.gitlab-ci.yml
variables:
  SLEEP: 3600

job:
  script:
  - sleep ${SLEEP}

Steps:

  1. Start job with the Runner configured as above and using the .gitlab-ci.yml above.

  2. Wait for job to get to the sleep command

  3. Inside of a terminal window run docker ps

  4. From the docker ps output find the build container and do a docker rm -f $CONTAINER_ID for example:

    example
    $ docker ps
    $ docker ps
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
    6e341b6eb96d        a187dde48cd2        "sh -c 'if [ -x /usr…"   20 seconds ago      Up 20 seconds                                runner-fl5ihr7-project-19-concurrent-0-build-4
    
    
    $ docker rm -f 6e341b6eb96d
    6e341b6eb96d
  5. You should see the build script stage being retried: Linux/Windows

  6. If you want you can remove the build containers 1 more time and it should fail the job for example

Linux/Windows EXECUTOR_JOB_SECTION_ATTEMPTS not set (current behavior on master)

  1. Start job with the Runner configured as above and using the .gitlab-ci.yml above.

  2. Wait for job to get to the sleep command

  3. Inside of a terminal window run docker ps

  4. From the docker ps output find the build container and do a docker rm -f $CONTAINER_ID for example:

    example
    $ docker ps
    $ docker ps
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
    6e341b6eb96d        a187dde48cd2        "sh -c 'if [ -x /usr…"   20 seconds ago      Up 20 seconds                                runner-fl5ihr7-project-19-concurrent-0-build-4
    
    
    $ docker rm -f 6e341b6eb96d
    6e341b6eb96d
  5. Job failed because the container is not found

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Added tests for this feature/bug
  • In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

Reference #4450 (closed)

Edited by Steve Xuereb

Merge request reports