Skip to content

Add timeout when waiting for the build to finish

Steve Xuereb requested to merge 4147-timeout-ignored-for-timedout-job into master

What does this MR do?

Add timeout when waiting for the build to finish

Why was this MR needed?

This is what happens for jobs like 304673282. If we look at the logs for 304673282 (full analysis here) we can see that the state was updated to timedout but we never see error logs like execution took longer than, we only log this in 1 place. We only update the state to timedout in 1 place. So if we see where handleError is called, it's called in 2 places, the one we are interesting in is when the context (timeout) is done/canceled. This also happens when the job is canceled and we update the state to canceled.

If the context is done, we will handle the error which in this case is creating the execution took longer than error and updating the state, but in the case of 304673282 we never are returning that error and the Job keeps running. This is because of <-buildFinish, we don't have any timeouts waiting for the any of our scripts to finish, which can be the build script, upload artifact script or cleanup script. So a process can hang forever and the build would never finish, yet the state is timeout out and we are still burning the minutes.

I think the only solution to this is to have a timeout until we wait for the build to finish, and just terminate the job altogether, the only problem we will end up with an environment that is not clean, but maybe we can fix that also.

I feel like this is only 1 case that users are facing, and not all job logs match this for example 307405134 (full analysis here) was returned a 403 when Runner tried to upload an artifact, and seem to have given up on retrying of continuing with finishing the build. I will dig into this later after I fix the first issue described above.

Are there points in the code the reviewer needs to double check?

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Added tests for this feature/bug
  • In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

closes #4147 (closed)

Edited by Steve Xuereb

Merge request reports