Stuck_CI_Builds_Worker marks builds running for more than 2 days as failed
The description of this issue has been updated after further investigation was done.
Summary
-
ci_builds.updated_at
does not get updated while builds are running. -
If the
stuck_ci_builds_worker
has changed the status of a build to failed, the runner ignores this transition and does not kill the build.
Steps to reproduce
- Create a long running build (requiring more than 48h to complete)
- Set the build timeout to something high, e.g. 10 days
- launch the build
Expected behavior
- The build should complete as normal after 48h+.
- The stuck_ci_builds_worker should not mark running builds as failed at any point while it is running.
- The runner should understand that a build is marked as failed on the CI and proceed to kill it if running.
Actual behavior
- On the midnight the day after you start the build, it will be marked as 'failed' by the stuck_ci_builds_worker.
- While the build may be marked as failed, the runner will continue to be processing the build and even sending logs to CI as normal.
Relevant logs and/or screenshots
Please see previous description of the issue and further comments below.
Possible fixes
Previous description of this issue:
I use gitlab CI to start a job which takes a long time to run (up to 4 days). It starts a python script which iterates over a large number of files and performs various operations on each, then at the end it returns a 0 or 1 exit code and exists. My YML file is similar to this:
soak_test:
stage: soak_test
script:
- python3 ./src/run_soak.py
tags:
- soak
The build timeout in runner settings is set to 10 days and the output size is also set very high.
The problem is this:
The job starts running, and everything is fine for a while (for the first couple of days). However, after an amount of time (usually more than 2000 minutes in), the job suddenly turns to 'failed' in Gitlab's builds view.
However, the job is still really running!. The python script started by gitlab-runner is still running and processing of the files continues as normal. Gitlab's CI viewer even gets updated with new logs, while the build is marked as 'failed'. The runner is sending logs as usual.
I managed to pull this out of the runner logs set in -debug mode:
time="2016-09-11T00:56:44+01:00" build=4558 build-log=0-128275 build-status=running sent-log=127991-128275 status="202 Accepted"
time="2016-09-11T00:57:14+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:57:44+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:58:14+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:58:44+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:59:15+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:59:45+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:00:15+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
**time="2016-09-11T01:00:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"**
time="2016-09-11T01:01:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:01:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:02:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:02:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:03:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:03:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:04:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:04:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:05:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:05:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:06:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:06:33+01:00" build=4558 build-log=0-128369 build-status=failed sent-log=128275-128369 status="202 Accepted"
time="2016-09-11T01:06:36+01:00" build=4558 build-log=0-128597 build-status=failed sent-log=128369-128597 status="202 Accepted"
time="2016-09-11T01:07:06+01:00" build=4558 build-log=0-128597 build-status=failed sent-log=128597-128597 status="202 Accepted"
time="2016-09-11T01:07:36+01:00" build=4558 build-log=0-128597 build-status=failed sent-log=128597-128597 status="202 Accepted"
As you can see, at some point the worker sees a change of the remote build-status from running
to failed
without any reason, as the job is still going on.
This job is executed on a runner configured with a shell executor on an Ubuntu 16.04 machine. An inspection of running processes on the build server confirms that nothing has exited:
root 32291 0.4 0.0 46268 10156 ? Ssl Sep05 38:31 /usr/bin/gitlab-ci-multi-runner --log-level debug --debug run --working-directory /ho
root 5592 0.0 0.0 56416 3368 ? S Sep09 0:00 \_ su -s /bin/bash gitlab-runner -c bash --login
gitlab-+ 5597 0.0 0.0 12600 2032 ? Ss Sep09 0:00 \_ bash --login
gitlab-+ 5599 0.0 0.0 12640 2172 ? S Sep09 0:00 \_ bash --login
gitlab-+ 26759 0.0 0.2 142904 54968 ? S 01:45 0:14 \_ python3 ./src/run_loop.py
gitlab-+ 16409 0.7 0.1 127640 32124 ? S 18:42 0:07 \_ python3 /home/gitlab-runner/builds/0e61934f/0/myscript/
gitlab-+ 16412 112 35.8 20773520 7371612 ? Sl 18:42 17:56 \_ /usr/bin/java -jar myjarscript.jar