Retry lease if worker reported a non-OK lease status
Before raising this MR, consider whether the following are required, and complete if so:
-
Unit tests -
Metrics -
Documentation update(s)
If not required, please explain in brief why not.
Description
This PR adds support for retrying leases that a have certain non-OK status set in the Lease. This is especially useful for the pending support for buildbox-worker
to communicate it needs to stop work due to a SIGTERM
, as it will allow that lease to be rescheduled seamlessly. To make this happen I moved update_job_lease_state
in scheduler.py
to retry a job if the lease state was COMPLETED
but it's status was not OK
. I also moved a bit of code relating to deleting job leases from bots/instance.py
to the scheduler, making things a bit cleaner and removing an inconsistency between the in-mem
scheduler and the sql
scheduler.
I thought about having a list of acceptable status codes for retry configurable somewhere, but as we already have a retry limit I felt just retrying all of them to start would be useful. I've configured a default list of non-OK statuses to retry instead of all of them, specifically INTERNAL
and UNAVAILABLE
. I was also thinking about putting ABORTED
here, but up for thoughts on what should be retryable by default and what we should treat as a final error.
While working tests for this change, I also found a somewhat unrelated issue where a new Lease
was created for each retry a job underwent, resulting in jobs having more than one lease internally. I fixed this by updating _create_lease
to essentially do an update or insert, depending on if a lease already exists for the job or not.
Validation
I've added extra tests for this behavior as well as a new one relating to a bot marking a lease as COMPLETED
with an OK
status + ActionResult
.