Gitlab Runner delete container before job finished
Summary
After upgraded to Gitlab 12.8.8 self hosted, we start observe a lot of jobs started to fail randomly with error
Cannot connect to the Docker daemon at unix:///var/run/docker.sock
before or after user script is ran.
We tried to turn on debug log for both Docker Daemon and Gitlab-runner and the sample log is as below.
Steps to reproduce
Actual gitlab-ci.yml does not matter, job could even fail with echo 1
script and failed instance wide randomly
Actual behavior
Job failed at either beginning or at the end of user script
Expected behavior
Job success
Relevant logs and/or screenshots
Systemd log
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com dockerd[1167]: time="2020-04-09T16:58:16.833203425+02:00" level=debug msg="Calling GET /v1.25/images/sha256:02f5a4635f2c5842cd3161eb28fd1d0201b5925004bf42c000c716b461c3847c/json"
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com dockerd[1167]: time="2020-04-09T16:58:16.833982321+02:00" level=debug msg="Calling GET /v1.25/networks"
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com dockerd[1167]: time="2020-04-09T16:58:16.839959720+02:00" level=debug msg="Calling DELETE /v1.25/containers/runner-tpJMREDe-project-177-concurrent-6-predefined-2?force=1&v=1"
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com gitlab-runner[14915]: ^[[37;1mRemoved container runner-tpJMREDe-project-177-concurrent-6-predefined-2 with Error: No such container: runner-tpJMREDe-project-177-concurrent-6-predefined-2 (executor_docker.go:861:0s)^[[0;m ^[[37;1mjob^[[0;m=51415529 ^[[37;1mproject^[[0;m=177 ^[[37;1mrunner^[[0;m=tpJMREDe
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com gitlab-runner[14915]: ^[[37;1mCreating container runner-tpJMREDe-project-177-concurrent-6-predefined-2 ...^[[0;m ^[[37;1mjob^[[0;m=51415529 ^[[37;1mproject^[[0;m=177 ^[[37;1mrunner^[[0;m=tpJMREDe
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com dockerd[1167]: time="2020-04-09T16:58:16.840365797+02:00" level=debug msg="Calling POST /v1.25/containers/create?name=runner-tpJMREDe-project-177-concurrent-6-predefined-2"
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com gitlab-runner[14915]: ^[[37;1mSkipping referees execution ^[[0;m ^[[37;1mjob^[[0;m=51415529 ^[[37;1mproject^[[0;m=177 ^[[37;1mrunner^[[0;m=tpJMREDe
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com gitlab-runner[14915]: ^[[31;1mERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (executor_docker.go:716:120s)^[[0;m ^[[31;1mduration^[[0;m=2m47.057010563s ^[[31;1mjob^[[0;m=51415529 ^[[31;1mproject^[[0;m=177 ^[[31;1mrunner^[[0;m=tpJMREDe
Apr 9 16:58:16 gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com gitlab-runner[14915]: ^[[31;1mERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (executor_docker.go:716:120s)^[[0;m ^[[31;1mduration^[[0;m=2m47.057010563s ^[[31;1mjob^[[0;m=51415529 ^[[31;1mproject^[[0;m=177 ^[[31;1mrunner^[[0;m=tpJMREDe
Here we see that docker daemon received a force DELETE on a container that is later seek by gitlab-runner. There is no other process using Docker on this host, its safe to assume that gitlab-runner was the one who called for the DELETE. Its a bit puzzling to me why gitlab-runner DELETE a container that it still need? Is this a race condition somewhere?
job log
[0;msection_start:1586440210:prepare_executor
[0K[0KUsing Docker executor with image golang:1.13-alpine ...
[0;m[0KPulling docker image golang:1.13-alpine ...
[0;m[0KUsing docker image sha256:2e384b27f926110890d5b7a82937b43ca8b27a1a2c389bef060882244b97f1aa for golang:1.13-alpine ...
[0;msection_end:1586440260:prepare_executor
[0Ksection_start:1586440260:prepare_script
[0Ksection_end:1586440380:prepare_script
[0Ksection_start:1586440380:upload_artifacts_on_failure
[0Ksection_end:1586440380:upload_artifacts_on_failure
[0K[31;1mERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (executor_docker.go:716:120s)
[0;m
Environment description
This is a self-hosted runner on baremetal with docker executor
config.toml contents
concurrent = 12
check_interval = 0
log_level = 'debug'
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlabrunner-1016.xxxx.xxxx.xxxxxxx.com / shared / docker"
output_limit = 8192
url = "https://gitlab.xxxxxxx.com/ci"
token = "xxxx"
executor = "docker"
pre_clone_script = "export http_proxy=http://webproxy.xxxx.xxxx.xxxxxxx.com:3128/; export https_proxy=$http_proxy; export no_proxy=.xxxxxxx.com"
pre_build_script = "export http_proxy=http://webproxy.xxxx.xxxx.xxxxxxx.com:3128/; export https_proxy=$http_proxy; export no_proxy=.xxxxxxx.com"
[runners.custom_build_dir]
[runners.docker]
tls_verify = false
image = "centos:7"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
devices = ["/dev/kvm"]
disable_cache = false
volumes = ["/cache"]
cpus = "4"
shm_size = 0
[runners.cache]
Type = "s3"
Path = "gitlab-ci-cache"
Shared = true
[runners.cache.s3]
ServerAddress = "s3-main-01.xxxxxxx.com"
AccessKey = "xxxxxxxxxx"
SecretKey = "xxxxxxxxxxxx"
BucketName = "gitlab-ci-cache"
Used GitLab Runner version
Tested with both 12.8 and 12.7.1
Version: 12.7.1
Git revision: 003fe500
Git branch: 12-7-stable
GO version: go1.13.5
Built: 2020-01-23T09:08:54+0000
OS/Arch: linux/amd64
Version: 12.8.0
Git revision: 1b659122
Git branch: 12-8-stable
GO version: go1.13.7
Built: 2020-02-22T03:36:44+0000
OS/Arch: linux/amd64