Skip to content

Remove existing worker if already registered in register_worker

Jeremiah Bonney requested to merge jbonney/jobassigner-key-error into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
  • Metrics
  • Documentation update(s)

If not required, please explain in brief why not.

Description

For the JobAssignmentThread we keep a mapping of Worker, an object we use to coordinate assignment of work between the assignment thread and waiting grpc requests, to bot_session.name, a unique name given to each bot session on creation. However by using bot_session.name, things get strange if two requests are being processed for the same bot_session.name. This shouldn't happen, but it can if a bot cancels an in-flight UpdateBotSession request and then sends a follow up UpdateBotSession. In this case, the "cancelled" UpdateBotSession is still processing in BuildGrid when the new UpdateBotSession comes in. This causes the JobAssignmentThread to get very confused, leading to KeyError being thrown due to a worker being removed from the worker_map.

This PR addresses this by explicitly removing the worker in register_worker if a worker already exists for this name. This makes it so various combinations of register_worker -> remove_worker when used from multiple threads with the same bot name doesn't cause the JobAssignmentThread to get in a bad state. It's not perfect though, as it has some unfortunate behavior where a worker may be prematurely removed. Consider the following two request r1/r2

r1 -> register_worker(foo)
r1 -> wait_for_work() # blocking until event set or times out
r2-> register_worker(foo) # Replaces r1
r1 -> wait_for_work() times out
r1 -> remove_worker(foo) # foo is removed so no work will be assigned to r2

Having r2 not get any work and wait around until the ttl expires isn't great, but it is much better than the current behavior of the JobAssignmentThread breaking. I couldn't come up with a cleaner way to solve this problem, and definitely happy to hear other thoughts.

Validation

Added a test to the job assigner to verify that calling register_worker with the same name twice doesn't cause things to break horribly.

Edited by Jeremiah Bonney

Merge request reports