Implement Max Job Execution Timeout (lazy, upon status check) (!337) · Merge requests · BuildGrid / buildgrid

Marios Hadjimichael requested to merge marios/implement-max-job-lazy-timeout into master Feb 18, 2020

Currently, when using a single buildgrid instance set-up with a sql backend, all jobs in executing state before the restart remain in that state and all subsequent requests to buildgrid will try to deduplicate the job, regardless of the bots not working on the lease anymore, resulting in those jobs getting stuck in Executing state forever (without any worker actually doing the work).

This MR introduces the concept of a maximum job execution timeout as a configurable option (with a default value), and before deduplicating jobs, or, before a client tries to get operation updates for a job, that timeout is lazily checked against the state of the job, and in cases where the job has been in the executing state for too long (based on the configurable option), those jobs, as well as relevant operations and leases will be cancelled.

The result of this is new jobs getting queued for that action digest, and clients that had been watching an operation can now receive the appropriate status (e.g. when buildgrid restarts those jobs are implicitly cancelled).

To Do:

Equivalent work for the In-Memory scheduler
Add tests
Update reference.yml and documentation

Note:

It would be nice to also have this behavior when listing operations. Since that is out of the scope of this MR and may require more work and refactoring, this issue has been opened to track it: !337 (merged)

Edited Mar 04, 2020 by Marios Hadjimichael

Implement Max Job Execution Timeout (lazy, upon status check)

To Do:

Note:

Merge request reports