Some CI jobs abruptly stop with "Job Suceeded" status
I have some CI jobs that take a while (approximately 30 minutes) and generate relatively long traces (around 7 MB). I noticed recently that some of the jobs weren't running to completion, but all of them still generated a "passed" status in the GitLab UI. I'm using a docker-machine
runner that provisions runners on Google Compute Engine.
I noticed that at a particular spot in the CI job (not always the exact same place in the output, but in the same vicinity), the output would be cut off and the job would immediately stop. In these cases, the output is even cut off mid-line, so I'm very certain that the CI job itself isn't finishing. Immediately after the line is cut off, the trace in the Web UI says Job Succeeded
(without a line break beforehand).
In an attempt to debug the problem, I executed the runner in debug mode (gitlab-runner --debug run
). The output from the runner is mostly repetitive for the duration of the job, but here is a snippet right near the end where I saw the problem:
Appending trace to coordinator... ok code=202 job=16557 job-log=0-6974356 job-status=running runner=bd061a1d sent-log=6970090-6974356 status=202 Accepted
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-04 23:59:44.434303522 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Appending trace to coordinator... ok code=202 job=16557 job-log=0-6982548 job-status=running runner=bd061a1d sent-log=6974356-6982548 status=202 Accepted
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-04 23:59:47.434548734 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Appending trace to coordinator... ok code=202 job=16557 job-log=0-6986790 job-status=running runner=bd061a1d sent-log=6982548-6986790 status=202 Accepted
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-04 23:59:50.434799088 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Appending trace to coordinator... ok code=202 job=16557 job-log=0-6990886 job-status=running runner=bd061a1d sent-log=6986790-6990886 status=202 Accepted
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-04 23:59:53.435034392 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Appending trace to coordinator... ok code=202 job=16557 job-log=0-6994982 job-status=running runner=bd061a1d sent-log=6990886-6994982 status=202 Accepted
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-04 23:59:56.435331798 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-04 23:59:59.435567851 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:02.437958597 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:05.438747372 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:08.43855231 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:11.438853409 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:14.439186216 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:17.439407017 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:20.439674989 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:23.439920625 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Appending trace to coordinator... ok code=202 job=16557 job-log=0-6994982 job-status=running runner=bd061a1d sent-log=6994982-6994982 status=202 Accepted
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:26.440211956 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:29.440464757 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:32.44073548 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:35.441095243 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:38.441385347 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:41.443207349 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Feeding runners to channel builds=1
Docker Machine Details creating=0 idle=0 maxMachines=20 minIdleCount=0 removing=0 runner=bd061a1d time=2018-03-05 00:00:44.443372011 -0500 EST total=1 used=1
Checking for jobs... nothing runner=bd061a1d
Killing container 54378fe42fd930c06290f41537c737ffc0da1453f39c69f51541f6abf14a4fe5 ... job=16557 project=27 runner=bd061a1d
Looking for prebuilt image gitlab/gitlab-runner-helper:x86_64-80b03db9 ... job=16557 project=27 runner=bd061a1d
The part at the very end caught my eye; it says that the runner is killing the container for some reason, but I don't see anything that would explain why. I glanced through the source code for gitlab-ci-multi-runner
, but I couldn't make much sense of it since I'm not a Go developer. It appears that the only reason why a container should be killed is the cancellation of the CI job, but that isn't happening; in fact, the Web UI indicates that the job is in a "passed" state. The job isn't timing out, either; my project has a timeout of 120 minutes set, and these problems are occurring in the approximately 30 minute range.