CANCELLED lease state is never sent to workers
Context
When an operation is cancelled, the Bots service should set the corresponding lease's status to CANCELLED
to tell the bot to cancel the job. However, BuildGrid scheduler removes the lease during the next UpdateBotSesion
call. From the view of the worker / bot, the lease is dropped instead of being cancelled.
This behavior might be acceptable for the worker to cancel the job when it sees a known lease disappear, but it also potentially makes tracing and debugging more difficult.
Root cause
The following lines are always dead code
# If the lease was marked cancelled on the buildgrid side
# inform the bot (update lease, no need to update bgd datastore)
if current_lease.state == LeaseState.CANCELLED.value:
lease.state = LeaseState.CANCELLED.value
return (lease, False)
because of
current_lease = self._scheduler.get_job_lease(lease.id)
# get_job_lease will only return active leases in sql
# data-store, so handle if no lease was returned
if current_lease is None:
return (None, False)
which can be verified in
lease = self.active_leases[0].to_protobuf() if self.active_leases else None
Expected behaviour
From RWAPI
CANCELLED: at any time, the service may change the state of a lease from PENDING or ACTIVE to CANCELLED;
the bot may not change to this state. The service then waits for the bot to acknowledge the change
by updating its own status to CANCELLED as well. Once both the service and the bot agree,
the service may remove it from the list of leases.
Current behaviour
The lease is dropped during the UpdateBotSesion
call.
Steps to reproduce
- execute a command like
sleep 3600
- cancel the operation
- observe the log
Relevant Log / Screenshot
Server-side
buildgrid_1 | 2022-12-29 20:06:56,014:[ buildgrid.server.job][DEBUG][gRPC_Executor_0]: Lease cancelled for job [6aa97a28-cf76-4d53-a2aa-3e86ee179bb3]: [6aa97a28-cf76-4d53-a2aa-3e86ee179bb3]
buildgrid_1 | 2022-12-29 20:06:56,014:[ buildgrid.server.job][DEBUG][gRPC_Executor_0]: State changed for job [6aa97a28-cf76-4d53-a2aa-3e86ee179bb3]: [CANCELLED] (lease)
...
buildgrid_1 | 2022-12-29 20:06:58,125:[ buildgrid.server.bots.instance][DEBUG][gRPC_Executor_1]: Removed lease id=[6aa97a28-cf76-4d53-a2aa-3e86ee179bb3] from bot=[/401c8afb-45ab-4e68-b610-b644b4b43eb1]
Worker-side
worker_1 | 2022-12-29T20:07:08.139+0000 [1:140586784218944] [buildboxworker_worker.cpp:579] [DEBUG] Lease [6aa97a28-cf76-4d53-a2aa-3e86ee179bb3] was tracked locally but not by the server, removing.
buildgrid_1 | 2022-12-29 20:06:58,125:[ buildgrid.server.bots.instance][DEBUG][gRPC_Executor_1]: Sending BotSession update for name=[/401c8afb-45ab-4e68-b610-b644b4b43eb1], bot_id=[test_worker]: leases=[].
Acceptance Criteria
The lease's state is set to CANCELLED
from the server, the worker acknowledges this, then both sides remove the lease.
Related Issue
This issue describes a similar but different bug - here, the lease is removed without ack instead of being cleared.