Skip to content

Retry lease if worker reported a non-OK lease status

Jeremiah Bonney requested to merge jbonney/retry-non-ok-lease-status into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
  • Metrics
  • Documentation update(s)

If not required, please explain in brief why not.

Description

This PR adds support for retrying leases that a have certain non-OK status set in the Lease. This is especially useful for the pending support for buildbox-worker to communicate it needs to stop work due to a SIGTERM, as it will allow that lease to be rescheduled seamlessly. To make this happen I moved update_job_lease_state in scheduler.py to retry a job if the lease state was COMPLETED but it's status was not OK. I also moved a bit of code relating to deleting job leases from bots/instance.py to the scheduler, making things a bit cleaner and removing an inconsistency between the in-mem scheduler and the sql scheduler.

I thought about having a list of acceptable status codes for retry configurable somewhere, but as we already have a retry limit I felt just retrying all of them to start would be useful. I've configured a default list of non-OK statuses to retry instead of all of them, specifically INTERNAL and UNAVAILABLE. I was also thinking about putting ABORTED here, but up for thoughts on what should be retryable by default and what we should treat as a final error.

While working tests for this change, I also found a somewhat unrelated issue where a new Lease was created for each retry a job underwent, resulting in jobs having more than one lease internally. I fixed this by updating _create_lease to essentially do an update or insert, depending on if a lease already exists for the job or not.

Validation

I've added extra tests for this behavior as well as a new one relating to a bot marking a lease as COMPLETED with an OK status + ActionResult.

Edited by Jeremiah Bonney

Merge request reports