Skip to content

Always replace first worker in job_assigner instead of erroring out

Jeremiah Bonney requested to merge jbonney/job-assigner-concurrent-rpc into master

Description

This PR adjusts the behavior added in !821 (merged) around what happens if the same worker appears to have multiple UpdateBotSession requests. The current behavior is to allow the second request to replace the first request in the job_assigner metadata, but only if the context is not active. If it is active, the second request is returned an error about the already in-flight request. This is great in theory, however context.is_active ends up not being super reliable in practice.

In the case of when buildbox-worker receives a signal in the middle of an active UpdateBotSession call, the ongoing RPC is cancelled and potentially a new UpdateBotSession request submitted. Sadly, this cancellation is best effort and not guaranteed to succeed, and I've personally seen increased rates of the cancellation not being noticed by BuildGrid by the time the second UpdateBotSession is processed. I suspect this behavior varies a lot on environment, having haproxy in the middle is probably making this worse in my case, but the current behavior isn't great when it does happen.

The simple fix is to unconditionally replace the initial request when a new one comes in. This gets rid of this error entirely and is consistent behavior-wise with CreateBotSession, where older BotSessions are closed and the newest one is considered active.

Issues addressed

Addresses #402 (closed)

Merge request reports