Skip synchronization updates for assignment timeouts (!1053) · Merge requests · BuildGrid / buildgrid

Cal Pratt requested to merge cpratt34/assignment-timeout-handling into master Apr 19, 2024

This is similar to the issues being mitigated in !1050 (merged) , but will also help for other cases where the bot name has not been updated.

We are still seeing instances of synchronization calls where request session data is missing and does not match the database records. In the previous case, this was due to stale update-bot-session requests lingering after a new create-bot-session request is made. This MR addresses the case where there is a stale update-bot-session requests lingering after a new update-bot-session request is made, for the same bot name/id pair.

My best theory as to why this is happening, is that the cancellation callback from grpcio is not being invoked... causing update-bot-session requests waiting on job assignments from two separate scheduler instances. This could be due to grpcio libraries or due to proxy configurations not properly forwarding cancellations.

When this case happens, only a single scheduler will perform an assignment and properly awake an awaiting update-bot-session call. When the winning scheduler assigns a job, it is returned to the worker immediately, and the worker will quickly ack the request moving the lease into the active state. Eventually a timeout occurs in the other stale request. If we enter the synchronization now after the timeout, then we will see an active job in the database, but nothing in the request supplied botsession.

If we skip calling synchronize bot session after a timeout occurs, we have two cases the next time a new UpdateBotSession comes in for these requests:

There are two competing worker processes for the same bot session.
- The session will be cancelled as is desired, because of the invalid state between request and database
There is only one worker process attempting to make updates.
- The response from the stale entry is never acknowledged, and we happily ignore the request.

One consequence of this change, is that for legitimately waiting requests, that the expiry time will no longer be updated by synchronize_bot_lease. To deal with this, I have reworked _assign_deadline_for_botsession to redundantly perform this update as well. This avoids the complexity of making the full synchronize_bot_lease call in the cases where we don't have strong confidence that there are not concurrent requests ongoing in a different scheduler. The only downside, is that it may take slightly longer to reap a legitimately expired bot session.

This MR also adds more improved logging 😄

Skip synchronization updates for assignment timeouts

Merge request reports