Add timeout when waiting for the build to finish
What does this MR do?
Add timeout when waiting for the build to finish
Why was this MR needed?
This is what happens for jobs like 304673282
. If we look at the logs
for 304673282
(full analysis here)
we can see that the state was updated to timedout
but we never
see error logs like execution took longer than
, we only log
this in 1 place.
We only update the state to timedout
in
1 place.
So if we see where
handleError
is called, it's called in 2 places, the one we are interesting
in is when the context (timeout) is done/canceled.
This also happens when the job is canceled
and we update the state to
canceled
.
If the context is done, we will handle the error which in this
case is creating the execution took longer than
error and
updating the state, but in the case of 304673282
we never are
returning that error and the Job keeps running. This is because of
<-buildFinish
,
we don't have any timeouts waiting for the any of our scripts to
finish, which can be the build script, upload artifact script or
cleanup script. So a process can hang forever and the build
would never finish, yet the state is timeout out and we are
still burning the minutes.
I think the only solution to this is to have a timeout until we wait for the build to finish, and just terminate the job altogether, the only problem we will end up with an environment that is not clean, but maybe we can fix that also.
I feel like this is only 1 case that users are facing, and not all job
logs match this for example 307405134
(full analysis
here)
was returned a 403 when Runner tried to upload an artifact, and seem to
have given up on retrying of continuing with finishing the build. I will
dig into this later after I fix the first issue described above.
Are there points in the code the reviewer needs to double check?
Does this MR meet the acceptance criteria?
-
Documentation created/updated -
Added tests for this feature/bug -
In case of conflicts with master
- branch was rebased
What are the relevant issue numbers?
closes #4147 (closed)