Skip to content

Add support for "watching" job state

Adam Coldrick requested to merge sotk/watching-jobs into master

Description

Currently, there are two sides of BuildGrid which communicate together via the shared Scheduler. These sides are the ExecutionService and the BotsService. A peer sends an Execute request to the ExecutionService, which creates a Job to hand out to a bot on demand. It also creates a message queue related to that job for sending progress updates back to the peer. Meanwhile, bots request jobs from the BotsService. This hands out the job that the peer requested, and when the bot reports a state change, sends a message to the relevant peers via the message queues tracked by the scheduler.

This works nicely for a single-scheduler BuildGrid, but limits the ability to horizontally scale due to the requirement that the bot is connected to the same instance as every peer for a given job, otherwise this communication breaks down.

This MR removes all of that communication infrastructure, to break the dependency on sharing a Scheduler between bots and peers for a given job. This will allow us to horizontally scale the core ExecutionService and BotsService components without adversely affecting communication.

It works by replacing the message queue created by a peer with a thread which waits for changes to the job in the shared data store (ie. the database used by all the instances). If using a PostgreSQL database, this waiting is implemented using LISTEN and NOTIFY to react as soon as changes are made. Otherwise, it relies on polling the data store on a regular basis.

The Scheduler keeps track of this thread, and also creates a threading.Event which is set and cleared when the thread detects a change in the job. Once the thread is created, the peer repeatedly waits for this event until the job is completed or cancelled, or the connection is closed. If a new peer makes a WaitExecution request for the same job, it simply waits for the same event, rather than creating a new watcher thread.

Once all connections are closed the watcher thread stops watching for changes.

Changes proposed in this merge request:

  • Create threads to watch job state for changes
  • Remove the existing message queue based communication system
  • Wait for events emitted from the watcher threads to handle communicating back to peers
  • Modify the test suite to account for these changes

Todo

  • Update tests that are now broken
  • Use PostgreSQL listen/notify to wait for changes rather than polling
  • Add some tests for parallel instantiation of the ExecutionService (this might end up in a separate MR)

This merge request, when merged, will address issue/bug:

#186 (closed)

Edited by Adam Coldrick

Merge request reports