Docker+autoscaler: Properly clean up when a job times out or is cancelled

When using the docker+autoscaler executor, runner connects to the remote daemon instead of the local one. That connection uses the main job context. When a job times out or is cancelled, that context is cancelled/expires, making it impossible to send all the cleanup commands to the remote daemon. This is the cause of Runner fails to clean up containers after job t... (#38725 - closed), https://gitlab.com/gitlab-com/request-for-help/-/issues/2434+, and possible other bugs.

For job Cleanup to work in this case, we need to create a new docker.Client in the docker executor's Cleanup method, with a new connection to the remote docker daemon, using a different context (that is still valid). We also need to make the network and volume managers use this new docker Client.

I've added a test case to the autoscaler integration tests that tests all of this, except for volume deletion. Unfortunately there isn't an API in our official docker client to list volumes (to ensure they also are deleted), and I couldn't justify adding a production API to be used only in one test. I did test this manually though to make sure it works.

Manual Testing

I ran the integration tests and simultaneously ran docker ps -a;echo;docker volume ls;echo; docker network ls repeatedly:

> docker ps -a;echo;docker volume ls;echo; docker network ls
CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS        PORTS     NAMES

DRIVER    VOLUME NAME

NETWORK ID     NAME      DRIVER    SCOPE
43eaa6dc9c89   bridge    bridge    local
2e2f096f1618   host      host      local
090ffafb0c79   none      null      local

...

> docker ps -a;echo;docker volume ls;echo; docker network ls
CONTAINER ID   IMAGE                           COMMAND                  CREATED         STATUS                    PORTS     NAMES
a2b503f13e3e   14119a10abf4                    "sh -c 'if [ -x /usr…"   1 second ago    Up 1 second                         runner-runner-t-project-0-concurrent-24627534-8e8202b0a9cd78da-build
68098e66eeb8   9a7f10a5b8ce                    "/usr/bin/dumb-init …"   3 seconds ago   Exited (0) 1 second ago             runner-runner-t-project-0-concurrent-24627534-8e8202b0a9cd78da-predefined

DRIVER    VOLUME NAME
local     runner-runner-t-project-0-concurrent-24627534-8e8202b0a9cd78da-cache-c81ae32c384bedc418fc4069dcbf376b

NETWORK ID     NAME                           DRIVER    SCOPE
43eaa6dc9c89   bridge                         bridge    local
2e2f096f1618   host                           host      local
090ffafb0c79   none                           null      local
3dde2a0eb14a   runner-runner-t-0-24627534-0   bridge    local

...

 > docker ps -a;echo;docker volume ls;echo; docker network ls
CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS        PORTS     NAMES

DRIVER    VOLUME NAME

NETWORK ID     NAME      DRIVER    SCOPE
43eaa6dc9c89   bridge    bridge    local
2e2f096f1618   host      host      local
090ffafb0c79   none      null      local

Another way to tests this is to run a job that times out, and check the job log trace to make sure there are no container, volume, or network removal errors (like so).

Runner config

[[runners]]
  name = "static runner"
  url = "https://gitlab.com"
  token = "XXXXXXXXX"
  executor = "docker-autoscaler"

  [runners.docker]
    tls_verify = false

  [runners.autoscaler]
    plugin = "fleeting-plugin-static"

    [runners.autoscaler.connector_config]
      use_static_credentials = true
      username = "XXX"
      key_path = "XXX"
      timeout = "1h"
      use_external_addr = true

    [runners.autoscaler.plugin_config]
      name = "local"
    [runners.autoscaler.plugin_config.instances.staticlinux]
      external_addr = "127.0.0.1"

CI job

stages:
  - test

test:
  stage: test
  image: alpine:latest
  timeout: 15s
  script:
    - sleep 900

Best reviewed commit-at-a-time.

Edited by Axel von Bertoldi

Merge request reports

Loading