Fix potential state inconsistency between Lease and Job
Before raising this MR, consider whether the following are required, and complete if so:
-
Unit tests -
Metrics -
Documentation update(s)
If not required, please explain in brief why not.
Description
This MR aims to fix a gap that potentially leads to state inconsistency between a Job and its Lease.
When a worker completes a lease, it calls UpdateBotSession
and the server does
- Update the lease in DB (and some fields of job)
- Delete the lease in DB
- Update the job in DB
The details can be found here: https://gitlab.com/BuildGrid/buildgrid/-/blob/20b3319aa0027deaa58ac441a83a859ab1258092/buildgrid/server/scheduler.py#L433-L451
Queries above are not executed in one DB transaction, so an error can cause inconsistent states and a job is stuck in Executing
state forever.
This MR allows a job to be marked as COMPLETED
even it's lease is already COMPLETED
to mitigate the problem above when the worker retries UpdateBotSession
.
Changes proposed in this merge request:
- Allow a job to be
COMPLETED
even if there's no active lease.