Skip to content

Handle bad/stale bots reporting to server and stealing work

Ed Baunton requested to merge edbaunton/buildgrid:quarantine into master

Description

Reviewing the specification of the protocol at https://gitlab.com/BuildGrid/buildgrid/blob/70784f40dcee549b04a7803072c446b628cc42db/google/devtools/remoteworkers/v1test2/bots.proto#L116-120 it appears that we do not currently correctly handle bots reporting themselves to the server that have been succeeded by another instance with a different id but the same name. This change effectively quarantines those bots. Currently we do not tell the bot to cancel any work it is doing since the protocol specifies that an error must be returned; we probably should update the bots to relinquish all work once it receives an error response. When an error response is delivered all work is requeued.

From the commit:

If CreateBotSession is called twice with the same bot name (which is
generated by the server) then the last should win. All other bots should
not be given work.

This changeset addresses the following scenarios in relation to the above:

1. There is the case where an update is received by the server from
   a bot that is not registered on the server: the bot should simply
   get an error response (that functionality already existed).

2. The case whereby an update is received from a bot whose name does
   not match that which we have on file (maybe it is an old
   instance) that hasn't been come down properly: it should not
   supercede the state stored by the bot with the same name that
   took over its role. Again, we return an error response.

3. The third and final case is to find any bots that have the bot_id
   provided by the updating bot that we are already managing on the
   server.  For example bot 1 creates and updates with id foo and
   bot 2 updates with id foo, we must reject any updates from bot 2
   since they have a mismatched name (most likely a bug).

There is also a minor commit to fix a naming issue in the worker exceptions I discovered while testing.

Edited by Ed Baunton

Merge request reports