Skip to content

Cancellations incorrectly clear the lease instead of updating its status

Context

When an operation is cancelled, the Bots service should set the corresponding lease's status to CANCELLED to tell the bot to cancel the job. However, BuildGrid currently "clears" the lease (removing all property values to leave an empty message) instead. This doesn't break BuildGrid's bot since it explicitly checks for this condition, but reccworker echos the invalid lease back to BuildGrid, further confusing it and preventing it from issuing subsequent jobs to that worker.

Steps to reproduce

  • Optionally, add logging statements to reccworker to log the sent/received bot session.
  • Start a BuildGrid server and run reccworker
  • Start a long-running job with bgd execute command [some directory] -- sleep 100
  • Cancel the job with bgd operation cancel [operation id]
  • Attempt to run a new job with bgd execution command [some directory] -- echo hello

Expected result

  • The lease's state is set to CANCELLED, the worker acknowledges this, then the job is removed.
  • The new job is executed.

Actual result

  • The lease is "cleared" (replaced with a completely empty Lease message) and never removed.
  • The new job stays queued forever.

Task Description

This bit of BotsInterface.update_bot_session is suspect -- it looks like the lease could be getting cleared if BotsInterface._check_lease_state is incorrectly returning None. (Also, shouldn't it be deleting the lease instead of clearing it?)

Acceptance Criteria

Running the above "Steps to Reproduce" correctly cancels the job and allows a new one to be run. (Also, we should probably have a test for this.)