CI job container waits/sleeps and so blocks further project jobs
Summary
A developer starts a pipeline and immediately afterwards (when the jobs are still in pending state) he cancels the jobs for this pipeline. When he wants to start a pipeline for the project some time later, the CI job reports an error that there already exists a container for the job and it can't start another one - so the job fails.
ContainerCreate: Error response from daemon: Conflict. The container name "/runner-9c8a6766-project-254-concurrent-0-docker-0-wait-for-service" is already in use by container "4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a". You have to remove (or rename) that container to be able to reuse that name. (executor_docker.go:1257:0s)
When looking on the gitlab-runner side there exists this exact container (mostly running for many hours) which is in the Docker state running
.
4954b5ad5f40 6aec5f3284ac "gitlab-runner-helpe…" Up 2 days runner-9c8a6766-project-254-concurrent-0-docker-0-wait-for-service
Docker logs doesn't show any output. Docker inspect shows the following.
Docker inspect:
---------------
docker inspect 4954b5ad5f40
,[
{
"Id": "4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a",
"Created": "2019-06-04T11:22:59.562724862Z",
"Path": "gitlab-runner-helper",
"Args": [
"health-check"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 25621,
"ExitCode": 0,
"Error": "",
"StartedAt": "2019-06-04T11:23:00.400500729Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
Looking a little deeper on the process state of this process it shows that it is in sleeping
state.
Linux ps on pid:
----------------
ps -ef | grep 4954b5ad5f40
root 3610 12156 0 11:37 pts/2 00:00:00 grep --color 4954b5ad5f40
root 25581 32482 0 Jun04 ? 00:00:15 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
ps -ef | grep 25581
root 3626 12156 0 11:37 pts/2 00:00:00 grep --color 25581
root 25581 32482 0 Jun04 ? 00:00:15 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
root 25621 25581 0 Jun04 ? 00:02:02 gitlab-runner-helper health-check
ps -ef | grep 25621
root 25621 25581 0 Jun04 ? 00:02:01 gitlab-runner-helper health-check
root 28463 28531 0 10:43 pts/3 00:00:00 grep --color 25621
Get proc status:
----------------
cat /proc/25581/status
Name: containerd-shim
State: S (sleeping)
cat /proc/25621/status
Name: gitlab-runner-h
State: S (sleeping)
Strace on pid:
--------------
strace -p 25581
strace: Process 25581 attached
futex(0x8bd2e8, FUTEX_WAIT, 0, NULL
strace -p 25621
strace: Process 25621 attached
futex(0x11651d0, FUTEX_WAIT, 0, NULL
Lsof on pid:
------------
lsof -p 25581
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
container 25581 root cwd DIR 0,18 120 703 /run/containerd/io.containerd.runtime.v1.linux/moby/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a
container 25581 root rtd DIR 252,0 4096 2 /
container 25581 root txt REG 252,0 4961544 3870 /usr/bin/containerd-shim
container 25581 root 0r CHR 1,3 0t0 6 /dev/null
container 25581 root 1w CHR 1,3 0t0 6 /dev/null
container 25581 root 2w CHR 1,3 0t0 6 /dev/null
container 25581 root 4u a_inode 0,11 0 8131 [eventpoll]
container 25581 root 5u a_inode 0,11 0 8131 [eventpoll]
container 25581 root 6u unix 0xffff880630efa800 0t0 237815667 @/containerd-shim/moby/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/shim.sock type=STREAM
container 25581 root 7u unix 0xffff880630ef9000 0t0 237826276 @/containerd-shim/moby/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/shim.sock type=STREAM
container 25581 root 8r FIFO 0,10 0t0 237826278 pipe
container 25581 root 9u FIFO 0,18 0t0 701 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stdout
container 25581 root 10r FIFO 0,10 0t0 237826279 pipe
container 25581 root 11w FIFO 0,18 0t0 701 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stdout
container 25581 root 12u FIFO 0,18 0t0 701 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stdout
container 25581 root 13r FIFO 0,18 0t0 701 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stdout
container 25581 root 14u FIFO 0,18 0t0 702 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stderr
container 25581 root 15w FIFO 0,18 0t0 702 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stderr
container 25581 root 16u FIFO 0,18 0t0 702 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stderr
container 25581 root 17r FIFO 0,18 0t0 702 /run/docker/containerd/4954b5ad5f402c35ec8fd86f8ddddd309c22d7daf9ad060fe1298d23bec8ae3a/init-stderr
lsof -p 25621
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
gitlab-ru 25621 root cwd DIR 0,81 4096 4561005 /
gitlab-ru 25621 root rtd DIR 0,81 4096 4561005 /
gitlab-ru 25621 root txt REG 252,1 14027488 5484652 /usr/bin/gitlab-runner-helper
gitlab-ru 25621 root 0u CHR 1,3 0t0 6 /dev/null
gitlab-ru 25621 root 1w FIFO 0,10 0t0 237826278 pipe
gitlab-ru 25621 root 2w FIFO 0,10 0t0 237826279 pipe
gitlab-ru 25621 root 3u sock 0,8 0t0 258537451 protocol: TCP
gitlab-ru 25621 root 4u a_inode 0,11 0 8131 [eventpoll]
A look at the gitlab-runner logs shows that there are WARNINGS
and an ERROR
which states a 403 Forbidden
when the developer canceled the pending job.
WARNING: Appending trace to coordinator... aborted code=403 job=397418 job-log= job-status=canceled runner=9c8a6766 sent-log=5816-6245 status=403 Forbidden
WARNING: Job failed: canceled duration=1m6.427300572s job=397418 project=349 runner=9c8a6766
WARNING: Submitting job to coordinator... aborted code=403 job=397418 job-status=canceled runner=9c8a6766
ERROR: Failed to process runner builds=1 error=canceled executor=docker runner=9c8a6766
Looking at the GitLab Nginx logs shows the same on that side.
10.1.1.1 - - [05/Jun/2019:09:13:47 +0200] "PATCH /api/v4/jobs/397418/trace HTTP/1.1" 403 49 "" "gitlab-runner 11.11.1 (11-11-stable; go1.8.7; linux/amd64)"
10.1.1.1 - - [05/Jun/2019:09:13:47 +0200] "PATCH /api/v4/jobs/397418/trace HTTP/1.1" 403 49 "" "gitlab-runner 11.11.1 (11-11-stable; go1.8.7; linux/amd64)"
10.1.1.1 - - [05/Jun/2019:09:13:47 +0200] "PUT /api/v4/jobs/397418 HTTP/1.1" 403 49 "" "gitlab-runner 11.11.1 (11-11-stable; go1.8.7; linux/amd64)"
Steps to reproduce
The described problem can be easily reproduced by canceling a pending job.
Example Project
What is the current bug behavior?
Cancelling a pending job doesn't stop the CI process correctly - a hanging container stays there until it gets manually stopped/killed.
What is the expected correct behavior?
Like in all versions before, this problem shouldn't occur. The gitlab-runner should take care of such containers and stop them if needed.
Relevant logs and/or screenshots
See the summary
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System: Debian 9.9 Current User: git Using RVM: no Ruby Version: 2.5.3p105 Gem Version: 2.7.6 Bundler Version:1.17.3 Rake Version: 12.3.2 Redis Version: 3.2.12 Git Version: 2.18.1 Sidekiq Version:5.2.5 Go Version: unknown
GitLab information Version: 11.10.4 Revision: 62c464651d2 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 9.6.11 URL: https://git.example.com HTTP Clone URL: https://git.example.com/some-group/some-project.git SSH Clone URL: git@git.example.com:some-group/some-project.git Using LDAP: yes Using Omniauth: yes Omniauth Providers: saml
GitLab Shell Version: 9.0.0 Repository storage paths:
- default: /var/opt/gitlab/git-data/repositories GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 9.0.0 ? ... OK (9.0.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK
Access to /var/opt/gitlab/.ssh/authorized_keys: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes Number of Sidekiq processes ... 1
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... Server: ldapmain LDAP authentication... Anonymous. No
bind_dn
orpassword
configured LDAP users with access to your GitLab server (only showing the first 100 results) DN: here follow many users, won't post them hereChecking LDAP ... Finished
Checking GitLab App ...
Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ...
here follow the project id's, won't post them here
Redis version >= 2.8.0? ... yes Ruby version >= 2.5.3 ? ... yes (2.5.3) Git version >= 2.18.0 ? ... yes (2.18.1) Git user has default SSH configuration? ... yes Active users: ... 336
Checking GitLab App ... Finished
Checking GitLab subtasks ... Finished