Gitaly affects the stability of tests
I got this one job: https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/31576391.
There's a lot of failures of:
E0906 21:18:13.620406012 126 completion_queue.c:667] Kick failed: {"created":"@1504732693.620286942","description":"Kick Failure","file":"src/core/lib/iomgr/ev_epollsig_linux.c","file_line":275,"referenced_errors":[{"created":"@1504732693.620283725","description":"OS Error","errno":11,"file":"src/core/lib/iomgr/ev_epollsig_linux.c","file_line":1131,"os_error":"Resource temporarily unavailable","syscall":"pthread_kill"}]}
I did login to machine pointed out at the top:
[0;m[0KUsing docker image dev.gitlab.org:5005/gitlab/gitlab-build-images:ruby-2.3.3-golang-1.8-git-2.13-phantomjs-2.1-node-7.1-postgresql-9.6 ID=sha256:b4413b57acb65fde549df3925f87404f0cc2cb7f5caa80cd6fa7ba119c2cd3ae for build container...
[0;mRunning on runner-30d62d59-project-13083-concurrent-0 via runner-30d62d59-auto-scale-1504709074-a6d4bc0a...
[32;1mFetching changes for 37158-autodevops-banner with git depth set to 20...[0;m
ssh kamil@private-runners-manager-1.gitlab.com
docker-machine ssh runner-30d62d59-auto-scale-1504709074-a6d4bc0a
And I saw:
root 1333 0.0 0.1 498324 2148 ? Ssl 14:45 0:06 /usr/bin/containerd --listen unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim /usr/bin/containerd-shim --sta
root 1464 0.0 0.0 216056 0 ? Sl 14:45 0:00 \_ /usr/bin/containerd-shim 68d0427d4c0c27541861e0691d5902643955a95cdc4d9353c300459e5806583f /var/run/docker/libcontainerd/68
root 1518 0.0 0.0 43332 0 ? Ssl 14:45 0:00 | \_ /bin/node_exporter -collector.procfs /host/proc -collector.sysfs /host/sys -collector.filesystem.ignored-mount-points
root 1465 0.0 0.0 216056 0 ? Sl 14:45 0:00 \_ /usr/bin/containerd-shim 95941524738ceb5b406b5a25bf1f7016b1aaa035bf9dfe3441be5f9571ea0737 /var/run/docker/libcontainerd/95
root 1517 0.0 0.0 1524 0 ? Ss 14:45 0:00 | \_ /sbin/init
root 1623 0.0 0.0 65908 332 ? Ssl 14:45 0:01 | \_ /usr/sbin/rsyslogd -i /var/run/rsyslogd.pid -f /etc/rsyslog.conf
root 1652 1.0 4.5 320532 92788 ? Ssl 14:45 4:04 | \_ /usr/bin/suricata --pidfile /var/run/suricata/suricata.pid -D --af-packet -c /etc/suricata/suricata.yaml
root 1467 0.0 0.0 215000 0 ? Sl 14:45 0:00 \_ /usr/bin/containerd-shim 6bed82cfe5227c6f5e7c5728b08ca8198dc84799cf23bf19c74b9b00fb7d5777 /var/run/docker/libcontainerd/6b
root 1510 3.0 1.2 303784 25532 ? Ssl 14:45 11:54 | \_ /usr/bin/cadvisor -logtostderr --port=9099
root 32413 0.0 0.0 207860 0 ? Sl 20:37 0:00 \_ /usr/bin/containerd-shim b2e18128ba754058686f1d30e50bf312eee6e72835ff081886d339ec3474db7b /var/run/docker/libcontainerd/b2
999 32428 4.9 9.0 1532888 186252 ? Ssl 20:37 1:58 | \_ mysqld
root 32468 0.0 0.0 151928 0 ? Sl 20:37 0:00 \_ /usr/bin/containerd-shim 22084228bca5e177378979ba8cbcc56daac2482fcbfb8853cbb40a228d5cd622 /var/run/docker/libcontainerd/22
100 32488 0.2 0.1 21904 2056 ? Ssl 20:37 0:04 | \_ redis-server
root 659 0.0 0.0 150520 0 ? Sl 20:40 0:01 \_ /usr/bin/containerd-shim 50391e068e52663e6c1af5a7ccfc2b2c09b15ff6b4b2933f49cf387b5e412358 /var/run/docker/libcontainerd/50
root 677 0.0 0.0 22024 84 ? Ss 20:40 0:00 \_ /bin/bash
root 698 0.0 0.0 22264 112 ? S 20:40 0:00 \_ /bin/bash
root 758 0.0 0.0 68420 464 ? Sl 20:42 0:00 | \_ /usr/local/bin/ruby /usr/local/bundle/bin/knapsack rspec --color --format documentation
root 769 0.0 0.0 4360 0 ? S 20:42 0:00 | \_ sh -c bundle exec rspec --color --format documentation --default-path spec -- "spec/requests/api/merge_req
root 770 48.2 57.2 2308548 1174340 ? Sl 20:42 17:10 | \_ /builds/gitlab-org/gitlab-ce/vendor/ruby/2.3.0/bin/rspec --color --format documentation --default-path
root 9642 11.3 6.9 1859284 142932 ? Sl 20:52 2:51 | \_ /usr/bin/phantomjs --load-images=yes --ignore-ssl-errors=yes --ssl-protocol=TLSv1 /builds/gitlab-o
root 757 3.6 1.1 216192 23280 ? Sl 20:42 1:18 \_ tmp/tests/gitaly/gitaly tmp/tests/gitaly/config.toml
root 772 0.4 0.2 1334436 4744 ? Sl 20:42 0:08 \_ bin/gitaly-ruby 56 /tmp/gitaly-ruby822767571/socket
root 834 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 835 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 836 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 837 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 838 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 840 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 841 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 842 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 845 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 847 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
root 849 0.0 0.0 0 0 ? Z 20:43 0:00 \_ [git] <defunct>
...
To be exact:
$ ps auxf | grep defunct | wc -l
8823
It simply means that process does exit, but we do not close process handle for it. This eats all our resources, and we end-up with CI job that is stack till time out, as Zombie prevents job to finish.
cc @andrewn