Enforce job timeout on server side

Job timeouts are currently enforced entirely in/by the client. If for e.g. a job is started and the client disappears, no timeout will be enforced and the job/step will run to completion.

This MR changes that by enforcing the timeout on the server side. The original approach to do this, described in Enforce Job Timeouts on the Server (#299 - closed), was to add a timeout field to the RunRequest struct, and create a context WithDeadline or WithTimeout with that deadline/timeout in Job.New.

In discussion @cam_swords and I talked about just using the context used to make the RunRequest as the job lifecycle context. With this approach, the caller (runner) would create a WithDeadline context with the deadline being the CI-configured job timeout, and then use that context to make the Run request. This is very slick from the caller's POV, but it does present one potential danger. The timeout/deadline for the Run request itself now becomes the CI job timeout, which means the Run request could hang for the entirety of the CI job timeout.

I tried this approach, but alas it does not work because gRPC cancels API request contexts at the end of the request. This would cause Job execution to be aborted when the Run request ended, which is not what we want.

The next best thing we can do is to do the same from the caller side, but on the server side extract the deadline from the passed context, and use it to create a new context. Basically we're smuggling the CI job timeout in the Run request context. This works and is what i have implemented here.

The Job.Ctx is now a context WithDeadline, where the deadline value is based on the context used in the Run API, which is in turn set by the caller (a.k.a. runner). Note that if the client (i.e. runner) does not set the deadline, gRPC WILL set a default deadline of 5s. Also note that runner is currently NOT setting this value, and to handle backward compatibility, we have to set a default/fallback value, which I've set to one hour.

For testing purpose we want to be able to set the default/fallback to arbitrary values, hence the WithJobDeadline function options.

After further discussion, I've gone ahead with the original implementation of adding a Timeout field to the RunRequest.

I've included one unit tests at the client level, and one (arguably weaker) at the Job level.

Edited by Axel von Bertoldi

Merge request reports

Loading