Tech Debt: Standardize job-retry logic
The following discussion from !468 (merged) should be addressed:
-
@jbonney started a discussion: (+2 comments)
Currently we have two ways to "retry/requeue" jobs.
- delete lease: causes a new lease to be created
- update lease state to "QUEUED": causes the lease to be reassigned to a worker.
This behavior is easy to break since different places do this in different ways, especially places that check how many "retries" have been attempted (e.g. depending on how we mark the retry, we may not increment that).
Let's come up with a single consistent way of doing this, e.g. on the invocation of retry_job_lease
, and remove the method called delete_job_lease
from the interface, which is used only for retries at the moment.
Suggestion:
- Move the lease-specific fields into the
jobs
table and delete leases table. It maps job_name -> job_name in a different table, and has an extra unique ID requiring indexing, and an extra join, for each job+lease. - When marking leases as "REQUEUED", just update the lease state to "QUEUED", after checking the n_tries has not been exceeded
- Maybe invent a new lease state: "retries_exhausted", or figure out what that would look like in terms of someone requesting the results of a job which has exceeded retries.
In addition to the above, from !479 (merged) ; we may want to persist leases in the database right away so that the bots can speak to any available buildgrid, when permissive bot-session mode is enabled.