Skip to content

Fix potential state inconsistency between Lease and Job

Zehao Chen requested to merge zchen723/fix-lease-job-inconsistency into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
  • Metrics
  • Documentation update(s)

If not required, please explain in brief why not.

Description

This MR aims to fix a gap that potentially leads to state inconsistency between a Job and its Lease.

When a worker completes a lease, it calls UpdateBotSession and the server does

  1. Update the lease in DB (and some fields of job)
  2. Delete the lease in DB
  3. Update the job in DB

The details can be found here: https://gitlab.com/BuildGrid/buildgrid/-/blob/20b3319aa0027deaa58ac441a83a859ab1258092/buildgrid/server/scheduler.py#L433-L451

Queries above are not executed in one DB transaction, so an error can cause inconsistent states and a job is stuck in Executing state forever.

This MR allows a job to be marked as COMPLETED even it's lease is already COMPLETED to mitigate the problem above when the worker retries UpdateBotSession.

Changes proposed in this merge request:

  • Allow a job to be COMPLETED even if there's no active lease.
Edited by Zehao Chen

Merge request reports